Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding dense components in weighted graphs Paul Horn 12-2-02.

Similar presentations


Presentation on theme: "Finding dense components in weighted graphs Paul Horn 12-2-02."— Presentation transcript:

1 Finding dense components in weighted graphs Paul Horn 12-2-02

2 Overview Addressing the problem What is the problem What is the problem How it differs from other already solved problems How it differs from other already solved problems Building a solution Already existing research Already existing research Preliminary work Preliminary work Final solution Final solution

3 Overview: The Sequel Analysis Testing Testing Effectiveness Effectiveness Time Complexity Time Complexity Future Work Trimming the data set more Trimming the data set more Linking it with real data Linking it with real data

4 The problem To find dense subgraphs of a graph. Not just the densest Not just the densest Not necessarily all, but as many as possible of graphs that are ‘dense enough’ Not necessarily all, but as many as possible of graphs that are ‘dense enough’ The idea is to identify communities based on a communications network The more dense the communication is in within a subgraph, the more likely it is a community The more dense the communication is in within a subgraph, the more likely it is a community

5 Why is it hard The fastest flow based methods for finding the single densest are cubic or worse. We want more than one dense subgraph The greedy approximation algorithm is destructive and thus returns only one graph The problem becomes harder when we allow subgraphs to overlap

6 Weighty Ideas Input graphs to the algorithm are weighted Weights of a graph represent the intensity of a communication Intensity represents the duration and frequency of a communication Intensity represents the duration and frequency of a communication Requires a new definition of density Requires a new definition of density

7 How dense can it get? Recall our old definition of density density We modify it to give a notion of density of a notion of density of a weighted graph weighted graph Note that if the weight of all edges is one the two definitions

8 Done before? Discussed in Charikar paper presentation Goldberg, A.V., Finding a Maximum Density Subgraph. A flow based maximum density subgraph algorithm Charikar, Greedy Approximation Algorithms for finding Dense Components in a Graph presented a linear approximation algorithm

9 Preliminary Work An implementation of Goldberg and Charikar’s algorithm In test data (generated in a dual-probability Erdos-Reyne model) Charikar’s algorithm identified close to the actual density graph In test data (generated in a dual-probability Erdos-Reyne model) Charikar’s algorithm identified close to the actual density graph These graphs, however were unweighted and thus ignored the weighted requirement, and it only had one dense subgraph. These graphs, however were unweighted and thus ignored the weighted requirement, and it only had one dense subgraph.

10 A First Attempt A modification of Charikar’s algorithm for weighted graphs At each step remove a random edge of lowest weight. Then find all connected components Recurse down on each component, and return the maximal density subgraph. By repeated executions of the algorithm the hope is that different dense components will be revealed, that can overlap.

11 Seems Promising, but… In test cases generated similarly to that used in testing Charikar and Goldberg’s algorithm, successfully identified close to, if not the entire, dense portions. In simulated communication network data, the graph was dense enough that large areas of the graph were denser than the smaller portions, and they were not found.

12 Partitioning? By partitioning optimally, by finding a cut of minimum size we can increase the density of the graph (to some extent) Since we cut edges of low weight, the edges of high weight remain on each of the partitions. Since we cut edges of low weight, the edges of high weight remain on each of the partitions. (Obviously) doesn’t work forever (Obviously) doesn’t work forever However knowing approximately what size we want we can find ideal candidates However knowing approximately what size we want we can find ideal candidates

13 Rethinking our algorithm Partitioning based algorithm idea Uses Kernighan-Lee to find close to optimal partitions. Uses Kernighan-Lee to find close to optimal partitions. Recurses down on the partitions until the are of the desired size. Recurses down on the partitions until the are of the desired size. The densest of the partitions left are our output. The densest of the partitions left are our output.

14 Finalizing our thought Run the algorithm on more than one partition. Random partitions are likely to be close to orthogonal. Generate k partitions, and take best l partitions (after KL is applied) at the top level On each other level, generate k partitions, and take the top one.

15 Analyzing the Situation The 2-approximation bound that we had for KL- is no longer necessarily valid. The algorithm has met with some success in identify clusters in simulated data, but needs more tuning with respect to size, and the trimming of the data set. By trimming out small partitions that are found that are similar, we reduce overlap Now may find too many graphs, or incorrect graphs but this problem can be relieved by taking only the small portions of a certain density (say, some percentage of the final)

16 Time it. Original modification to Charikar runs in approximately O(|V||E|) time New algorithim runs in approximately O(kl|V| 2 log|V|) time. k, l due to generated the k partitions each time, and picking the top l at each step. k, l due to generated the k partitions each time, and picking the top l at each step. |V| 2 is a result of Kernighan-Lee |V| 2 is a result of Kernighan-Lee log|V| is the result of continuing to partition log|V| is the result of continuing to partition In practice runs very fast. Partitioning graphs of size 10000+ vertices is possible in a reasonable amount of time. In practice runs very fast. Partitioning graphs of size 10000+ vertices is possible in a reasonable amount of time.

17 In the future The algorithm still needs to better trim the partitions it finds, and specifically needs to find partitions of more variable size Could perhaps trim based on the density of the entire graph, or perhaps based on a maximum density subgraph (as found by the modified Charikar) Could perhaps trim based on the density of the entire graph, or perhaps based on a maximum density subgraph (as found by the modified Charikar) Already finds graphs of many sizes, but only considers the smallest at the end, so could be modified to include more of the larger partitions Already finds graphs of many sizes, but only considers the smallest at the end, so could be modified to include more of the larger partitions

18 In the Future II Future data will not be simulated, but instead come from online sources Running on a newsgroup induced graph, for instance, can hopefully help identify groups interested in particular topics. Running on a newsgroup induced graph, for instance, can hopefully help identify groups interested in particular topics. Finding graphs based on email or portions of the web graph, could help identify groups of friends or topic-related sites as well, and thus help predict communities Finding graphs based on email or portions of the web graph, could help identify groups of friends or topic-related sites as well, and thus help predict communities

19 So What? By looking at not just a graph, but a series of time based graph we can identify communities and how they change over time. Using this method we can hope to identify rules which govern the changes of these communities and make predictions on their future actions Simulated data used was designed with this end in mind.

20 Summing Up Finding multiple dense subgraphs of a graph is a relatively unexplored topic, especially finding dense subgraphs of large graphs (so that exact algorithms are unreasonable) Prior work (such as Goldberg and Charikar) centered on finding a single densest subgraph

21 Summing down First algorithm a modification of Charikar centered around removing edges and finding connected components Second algorithm based on Kernighan- Lee algorithm for finding optimal partitions, and recursing down to find small subgraphs that are generated by cutting a small number of vertices.

22 The Summing Still work to do: Linking it back to the real data Linking it back to the real data Internet data from newsgroups, email, etc Using that to find communities over time Finding microlaws that govern them based on how the communities change over time Finding better ways to trim data to ensure that the best candidates are found Finding better ways to trim data to ensure that the best candidates are found


Download ppt "Finding dense components in weighted graphs Paul Horn 12-2-02."

Similar presentations


Ads by Google