© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew Tomkins Yahoo! Research* VLDB, Trondheim, September 1, 2005 * (Work performed while at IBM Almaden)

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 3 of 19 Applications  Web communities –4B Web pages + hyperlinks  Host collusion –50M Web hosts + intersite links  Blogging neighbourhoods –4M Users + friend links  Telephone call networks –Subscribers + people called  Email graph –Enron employees + correspondents

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 4 of 19 Other Approaches  Trawling for bipartite cores [Kumar et al 1999]  Network flow [Flake et al 2000]  Peeling [Abello et al 2002]  Bursts [Tomkins et al 2003]  Why is discovering dense subgraphs hard? –Size of locally dense regions is highly variable

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 5 of 19 Graphs, cliques, and dense subgraphs  Our goal: Find large, dense, subgraphs  Constraints: Stream processing model Out-of-core sort G 60% 67% C1C1 C2C2 100%

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 6 of 19 Shingling  The text problem: Create a document fingerprint which is immune to small changes. 1.Convert to a set of shingles 2.Hash each element of the set 3.Return minimum hash value 4.Repeat with different hash functions Hash Element 1: òverlapping subsequences of‘ 23 Element 2: `subsequences of words‘ 12 Minimum Element 3: òf words in‘ 39 Element 4: `words in the‘ 22 Element 5: ìn the document‘ 44

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 7 of 19 Shingling II  Shingling general sets –Jaccard similarity between sets A and B: –P[shingle matches] = J(A,B) = | A ∩ B | / | A U B |  Parameters –Pick c shingles to improve estimate –Pick s = size of shingles for stricter matching A B

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 8 of 19 Algorithm  Edge table representation: v1  w1 w2 v2  w2 w3 v3  w4 w5  UNION-FIND identifies clusters –Scan edge table once –O(log n) memory is possible UNION-FIND Exact-Match Too lenientToo strict  Need to find dense clusters of similar edge lists  Use shingles to compare edge lists –And reduce data volume v1 v2 v3 w1 w2 w3 w4 w5

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 9 of 19 Algorithm Shingle outlink sets: v  w1 … wN  v  s1 … sC Transpose to find sets of v’s: s1  v1 v2 … s2  v1 v3 … Could run UnionFind now Or, reduce graph again! Reduces data volume Finds dense clusters of v’s V W S N C Shingle

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 11 of 19 Algorithm RecursiveShingle( E ) Shingle:S[v] = Shingle( E[v] ) for v in V Transpose: E’[s] = { v | s in S[v] } Recurse: clusters = RecursiveShingle( E’ ) base:clusters = UnionFind( E’ ) Map back:return { U v in C E[v] | C in clusters } E0 E1 E2

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 12 of 19 Data Stream Processing RecursiveShingle( E ) Shingle:Linear scan of E Transpose: Sort of size |E’| Recurse: (2 or 3 times) UnionFind is linear Map back:Linear scan of clusters and E

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 13 of 19 Data Set: The Web Host Graph  2.1 billion pages in the WebFountain store in September 2004  Site Browser system aggregates site information –50 million hostnames –11 billion host  host links. Mean outdegree = 220  Historical trace June – September, every two weeks –How do large clusters form?

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 14 of 19 Test Runs Strict shingles Nonstrict shingles Vertices (M) Edges (M) Vertices (M) Edges (M) 05011 000110 GB5011 000 12754205.5 GB9572 500 26098690 MB10001 200 700750 Running time: O(days)

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 15 of 19 Link Spam and Search Engines  Some results –Several hundred giant dense subgraphs of at least 10 000 nodes –2000 dense subgraphs of at least 1000 nodes –64 000 dense subgraphs of at least 100 nodes  Sampling of clusters –88% are clearly spam networks  Clusters can be used to weight search engine results –Easy to integrate into search engine workflow

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 18 of 19 Historical study  Study the growth of inlinks to cluster centers  10% growth in 3 months. Most growth is bursty Unique IP address inlinks

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 19 of 19 Summary  Shingles + Recursion = Large Dense Subgraphs  Extensions: –Undirected graphs, hierarchical decompositions –Other application areas, such as blogs  Data stream algorithms scale well  Thank you! davgib@us.ibm.com

© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

Similar presentations

Presentation on theme: "© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

Similar presentations

Presentation on theme: "© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew."— Presentation transcript:

Similar presentations

About project

Feedback