Presentation on theme: "Social network partition Presenter: Xiaofei Cao Partick Berg."— Presentation transcript:
Social network partition Presenter: Xiaofei Cao Partick Berg
Problem Statement Say we have a graph of nodes, representing anything you can imagine, but for our purposes let’s say it represents the population spread out across a country. Some of these nodes are closely batched together, representing a “community”, while others are further away representing a different “community”. We want to detect the communities in this graph. So how do we do this?
Some Definitions Before we try and develop a solution to this problem, we should get a few common definitions out of the way first. Degrees – A degree is the number of edges connected to a node. Community – A community is a grouped together (by some similarity) set of nodes that are densely connected internally.
A C B D G FE H Degree of C is 3 Degree of D is 2
Why find Communities? We want to find communities to see the relations between groups and their connections to others. We can use this to find groups of people that share particular traits easily, such as terrorist organizations (or any other social network).
How do we find Communities? Vertex Betweenness – This is a measure of a vertex (or node’s) centrality within the graph. This quantifies the number of times a node acts as a bridge in a shortest path between two other nodes.
Use BFS to find shortest-paths We use the BFS (Breadth First Search) Algorithm to find the shortest paths between each node and every other node. From this we can calculate the vertex betweenness for each node.
Girvan Newman Algorithm We can use the Girvan Newman algorithm to detect communities in the graph. Girvan Newman takes the “Betweenness” score and extends the definition to edges. So an edge “Betweenness” score is the number of shortest paths between a pair of nodes that runs along it. If there are more than one shortest paths, each path is assigned a value such that all paths have equal value.
Girvan Newman Algorithm Continued We can see that by using this method of edge “betweenness” scoring that communities will have lower edge scores between nodes in their community and higher edge scores along edges that connect them to other communities. To find the community, we now remove the highest scoring edge and re-calculate the “betweenness” score for each of the affected edges.
Example A C D B The highest edge score is 6, connecting node A to node C. So we remove this edge first.
Girvan Newman Algorithm Continued Now we continue to remove each highest score edge from the graph and recalculate until no edges remain. The end result is a dendrogram that shows the clusters of communities in our graph.
Proposed by Girvan-Newman in paper: "Community structure in social and biological networks." Proceedings of the National Academy of Sciences 99.12 (2002): 7821-7826. Complete algorithm in paper: "Finding and evaluating community structure in networks." Physical review E 69.2 (2004): 026113. Sequential Algorithm
Girvan Newman algorithm Goal: find the edge with the highest betweenness score and remove it. Continue doing that until the graph been partitioned. Import: The graph for every iteration. (adjacency matrix) Output: The betweenness score for every edges. (Betweenness matrix) The algorithm can be separate into 2 parts.
Part I: Find the number of shortest path from one node to every other nodes From top to down. Using breadth first algorithm to generate a new view for that node. Find the number of shortest path.
Part II Calculate the edges betweenness score for every iteration From bottom to up. Every nodes contain one score. Every edges’ score equal to Node_score/#shortest_path*(# of shortest path to the upper layer nodes) Sum up edges’ scores for every iteration.
Analysis the time complex Number of iteration in the big loop: n (number of nodes) Time complex of finding the shortest path: O(n^2) Time complex of calculating the betweenness score: O(n) Adding the betweenness matrix: n^2 Time complex is: n*(n^2+n+n^2)=O(n^3);
Parallel algorithm (Intuitively) Assigned every processor the same adjacency matrix of the original network. They start from different nodes. Generating views and calculating the betweenness matrix for each starting nodes. Then sum the matrix locally first. Doing prefix sum and update the original network by remove the highest score edges.
P1P2P4P3P5P6P8P7 G1,G2,G3G4,G5,G6G7,G8,G9G10,G11,G12G13,G14,G15G16,G17,G18G19,G20,G21G22,G23,G24 Breath first algorithm V1, V2, V3 V4, V5, V6 V7, V8, V9 V10, V11, V12 V13, V14, V15 V16, V17, V18 V19, V20, V21 V22, V23, V24 Sum the between -ness score locally B1B2B4B3B5B6B8B7 Parallel Prefix Sum B1B2B4B3B5B6B8B7 B1B2B4B3B5B6B8B7 B1B2B4B3B5B6B8B7 B1B2B4B3B5B6B8B7 Use B8 Value to update network Gn: start from node n in graph
Analysis of time complex Number of iteration: n/p; Find the number of shortest path: O(n^2); Find the betweenness score: O(n); Adding betweenness score locally: O(n^2); Adding betweenness score globally(prefix sum): O(n^2*log(p)) Time complex: n/p*(n^2+n+n^2)+n^2*log(p) =n^2(n/p+log(p));
Continue Speed up: n/(n/p+log(P)) When n=p*log(p); speed up = p; It is cost optimal.