© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

Near-Duplicates Detection
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
ITEC 320 Lecture 12 Higher level usage of pointers.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Lecture 21: Spectral Clustering
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
Graph Algorithms: Minimum Spanning Tree We are given a weighted, undirected graph G = (V, E), with weight function w:
Aki Hecht Seminar in Databases (236826) January 2009
Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Course Review COMP171 Spring Hashing / Slide 2 Elementary Data Structures * Linked lists n Types: singular, doubly, circular n Operations: insert,
Near Duplicate Detection
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Computer Science 1 Web as a graph Anna Karpovsky.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
WEB SCIENCE: ANALYZING THE WEB. Graph Terminology Graph ~ a structure of nodes/vertices connected by edges The edges may be directed or undirected Distance.
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Computer Science 112 Fundamentals of Programming II Introduction to Graphs.
Finding dense components in weighted graphs Paul Horn
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Information Flow using Edge Stress Factor Communities Extraction from Graphs Implied by an Instant Messages Corpus Franco Salvetti University of Colorado.
Data Structures & Algorithms and The Internet: A different way of thinking.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
Spamscatter: Characterizing Internet Scam Hosting Infrastructure By D. Anderson, C. Fleizach, S. Savage, and G. Voelker Presented by Mishari Almishari.
Mathematics of Networks (Cont)
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Data Structures & Algorithms Graphs
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
A Scalable Pattern Mining Approach to Web Graph Compression with Communities Greg Buehrer and Kumar Chellapilla Microsoft Live Labs.
VLDB2005 CMS-ToPSS: Efficient Dissemination of RSS Documents Milenko Petrovic Haifeng Liu Hans-Arno Jacobsen University of Toronto.
NP-Complete problems.
Visualizing Massive Multi-Digraphs James Abello Jeffrey Korn Information Visualization Research Shannon Laboratories, AT&T Labs-Research All the graphs.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Data Structures and Algorithms in Parallel Computing Lecture 3.
Data Structures and Algorithms in Parallel Computing Lecture 7.
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
Mining of Massive Datasets Ch4. Mining Data Streams
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Models of Web-Like Graphs: Integrated Approach
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa.
Cohesive Subgraph Computation over Large Graphs
Near Duplicate Detection
Graph partitioning I: Dense Sub-Graphs
IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS
CS120 Graphs.
Graphs Graph transversals.
I don’t need a title slide for a lecture
Graph-Based Anomaly Detection
CSE 421, University of Washington, Autumn 2006
Lecture 10 Graph Algorithms
CSE 421 Richard Anderson Autumn 2019, Lecture 3
Presentation transcript:

© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew Tomkins Yahoo! Research* VLDB, Trondheim, September 1, 2005 * (Work performed while at IBM Almaden)

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 2 of 19 Agenda  Application Areas  Other Approaches  Shingling  Recursion  Data Set  Performance  Results  Evolution studies

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 3 of 19 Applications  Web communities –4B Web pages + hyperlinks  Host collusion –50M Web hosts + intersite links  Blogging neighbourhoods –4M Users + friend links  Telephone call networks –Subscribers + people called  graph –Enron employees + correspondents

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 4 of 19 Other Approaches  Trawling for bipartite cores [Kumar et al 1999]  Network flow [Flake et al 2000]  Peeling [Abello et al 2002]  Bursts [Tomkins et al 2003]  Why is discovering dense subgraphs hard? –Size of locally dense regions is highly variable

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 5 of 19 Graphs, cliques, and dense subgraphs  Our goal: Find large, dense, subgraphs  Constraints: Stream processing model Out-of-core sort G 60% 67% C1C1 C2C2 100%

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 6 of 19 Shingling  The text problem: Create a document fingerprint which is immune to small changes. 1.Convert to a set of shingles 2.Hash each element of the set 3.Return minimum hash value 4.Repeat with different hash functions Hash Element 1: `overlapping subsequences of‘ 23 Element 2: `subsequences of words‘ 12 Minimum Element 3: `of words in‘ 39 Element 4: `words in the‘ 22 Element 5: `in the document‘ 44

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 7 of 19 Shingling II  Shingling general sets –Jaccard similarity between sets A and B: –P[shingle matches] = J(A,B) = | A ∩ B | / | A U B |  Parameters –Pick c shingles to improve estimate –Pick s = size of shingles for stricter matching A B

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 8 of 19 Algorithm  Edge table representation: v1  w1 w2 v2  w2 w3 v3  w4 w5  UNION-FIND identifies clusters –Scan edge table once –O(log n) memory is possible UNION-FIND Exact-Match Too lenientToo strict  Need to find dense clusters of similar edge lists  Use shingles to compare edge lists –And reduce data volume v1 v2 v3 w1 w2 w3 w4 w5

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 9 of 19 Algorithm Shingle outlink sets: v  w1 … wN  v  s1 … sC Transpose to find sets of v’s: s1  v1 v2 … s2  v1 v3 … Could run UnionFind now Or, reduce graph again! Reduces data volume Finds dense clusters of v’s V W S N C Shingle

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 10 of 19 Algorithm 1.Shingle 2.Transpose 3.Recurse 4.Map back V W V’ V’’ etc… E0 E1 E2 0. Base case: UnionFind Shingle

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 11 of 19 Algorithm RecursiveShingle( E ) Shingle:S[v] = Shingle( E[v] ) for v in V Transpose: E’[s] = { v | s in S[v] } Recurse: clusters = RecursiveShingle( E’ ) base:clusters = UnionFind( E’ ) Map back:return { U v in C E[v] | C in clusters } E0 E1 E2

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 12 of 19 Data Stream Processing RecursiveShingle( E ) Shingle:Linear scan of E Transpose: Sort of size |E’| Recurse: (2 or 3 times) UnionFind is linear Map back:Linear scan of clusters and E

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 13 of 19 Data Set: The Web Host Graph  2.1 billion pages in the WebFountain store in September 2004  Site Browser system aggregates site information –50 million hostnames –11 billion host  host links. Mean outdegree = 220  Historical trace June – September, every two weeks –How do large clusters form?

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 14 of 19 Test Runs Strict shingles Nonstrict shingles Vertices (M) Edges (M) Vertices (M) Edges (M) GB GB MB Running time: O(days)

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 15 of 19 Link Spam and Search Engines  Some results –Several hundred giant dense subgraphs of at least nodes –2000 dense subgraphs of at least 1000 nodes – dense subgraphs of at least 100 nodes  Sampling of clusters –88% are clearly spam networks  Clusters can be used to weight search engine results –Easy to integrate into search engine workflow

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 16 of 19 Reduction in outdegree 1 2 3

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 17 of 19 Cluster Sizes Depth 2 Depth 3

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 18 of 19 Historical study  Study the growth of inlinks to cluster centers  10% growth in 3 months. Most growth is bursty Unique IP address inlinks

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 19 of 19 Summary  Shingles + Recursion = Large Dense Subgraphs  Extensions: –Undirected graphs, hierarchical decompositions –Other application areas, such as blogs  Data stream algorithms scale well  Thank you!

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 20 of 19 K

VLDB 2005 Discovering Dense Subgraphs © 2005 IBM Corporation Slide 21 of 19 Example: complete subgraphs