Download presentation

Presentation is loading. Please wait.

Published byIsai Graddick Modified about 1 year ago

1
© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew Tomkins Yahoo! Research* VLDB, Trondheim, September 1, 2005 * (Work performed while at IBM Almaden)

2
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 2 of 19 Agenda Application Areas Other Approaches Shingling Recursion Data Set Performance Results Evolution studies

3
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 3 of 19 Applications Web communities –4B Web pages + hyperlinks Host collusion –50M Web hosts + intersite links Blogging neighbourhoods –4M Users + friend links Telephone call networks –Subscribers + people called Email graph –Enron employees + correspondents

4
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 4 of 19 Other Approaches Trawling for bipartite cores [Kumar et al 1999] Network flow [Flake et al 2000] Peeling [Abello et al 2002] Bursts [Tomkins et al 2003] Why is discovering dense subgraphs hard? –Size of locally dense regions is highly variable

5
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 5 of 19 Graphs, cliques, and dense subgraphs Our goal: Find large, dense, subgraphs Constraints: Stream processing model Out-of-core sort G 60% 67% C1C1 C2C2 100%

6
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 6 of 19 Shingling The text problem: Create a document fingerprint which is immune to small changes. 1.Convert to a set of shingles 2.Hash each element of the set 3.Return minimum hash value 4.Repeat with different hash functions Hash Element 1: `overlapping subsequences of‘ 23 Element 2: `subsequences of words‘ 12 Minimum Element 3: `of words in‘ 39 Element 4: `words in the‘ 22 Element 5: `in the document‘ 44

7
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 7 of 19 Shingling II Shingling general sets –Jaccard similarity between sets A and B: –P[shingle matches] = J(A,B) = | A ∩ B | / | A U B | Parameters –Pick c shingles to improve estimate –Pick s = size of shingles for stricter matching A B

8
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 8 of 19 Algorithm Edge table representation: v1 w1 w2 v2 w2 w3 v3 w4 w5 UNION-FIND identifies clusters –Scan edge table once –O(log n) memory is possible UNION-FIND Exact-Match Too lenientToo strict Need to find dense clusters of similar edge lists Use shingles to compare edge lists –And reduce data volume v1 v2 v3 w1 w2 w3 w4 w5

9
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 9 of 19 Algorithm Shingle outlink sets: v w1 … wN v s1 … sC Transpose to find sets of v’s: s1 v1 v2 … s2 v1 v3 … Could run UnionFind now Or, reduce graph again! Reduces data volume Finds dense clusters of v’s V W S N C Shingle

10
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 10 of 19 Algorithm 1.Shingle 2.Transpose 3.Recurse 4.Map back V W V’ V’’ etc… E0 E1 E2 0. Base case: UnionFind Shingle

11
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 11 of 19 Algorithm RecursiveShingle( E ) Shingle:S[v] = Shingle( E[v] ) for v in V Transpose: E’[s] = { v | s in S[v] } Recurse: clusters = RecursiveShingle( E’ ) base:clusters = UnionFind( E’ ) Map back:return { U v in C E[v] | C in clusters } E0 E1 E2

12
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 12 of 19 Data Stream Processing RecursiveShingle( E ) Shingle:Linear scan of E Transpose: Sort of size |E’| Recurse: (2 or 3 times) UnionFind is linear Map back:Linear scan of clusters and E

13
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 13 of 19 Data Set: The Web Host Graph 2.1 billion pages in the WebFountain store in September 2004 Site Browser system aggregates site information –50 million hostnames –11 billion host host links. Mean outdegree = 220 Historical trace June – September, every two weeks –How do large clusters form?

14
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 14 of 19 Test Runs Strict shingles Nonstrict shingles Vertices (M) Edges (M) Vertices (M) Edges (M) 05011 000110 GB5011 000 12754205.5 GB9572 500 26098690 MB10001 200 700750 Running time: O(days)

15
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 15 of 19 Link Spam and Search Engines Some results –Several hundred giant dense subgraphs of at least 10 000 nodes –2000 dense subgraphs of at least 1000 nodes –64 000 dense subgraphs of at least 100 nodes Sampling of clusters –88% are clearly spam networks Clusters can be used to weight search engine results –Easy to integrate into search engine workflow

16
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 16 of 19 Reduction in outdegree 1 2 3

17
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 17 of 19 Cluster Sizes Depth 2 Depth 3

18
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 18 of 19 Historical study Study the growth of inlinks to cluster centers 10% growth in 3 months. Most growth is bursty Unique IP address inlinks

19
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 19 of 19 Summary Shingles + Recursion = Large Dense Subgraphs Extensions: –Undirected graphs, hierarchical decompositions –Other application areas, such as blogs Data stream algorithms scale well Thank you! davgib@us.ibm.com

20
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 20 of 19 K

21
VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 21 of 19 Example: complete subgraphs

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google