Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, Aniraban Dasgupta

 Large on-line computing applications have detailed records of human activity:  On-line communities: Facebook (120 million)  Communication: Instant Messenger (~1 billion)  News and Social media: Blogging (250 million)  We model the data as a network (an interaction graph) Can observe and study phenomena at scales not possible before Communication network

 The Small-world experiment: ▪ On a 240 million node communication network of Microsoft Instant Messenger  Small vs. large networks: ▪ Modeling community (cluster) structure of large networks 3 Zachary’s karate club (N=34) Tiny part of a large social network

 How community like is a set of nodes?  Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure.  Conductance (normalized cut) Φ(S) = # edges cut / # edges inside  Small Φ(S) corresponds to more community-like sets of nodes S S’ 4

Score: Φ(S) = # edges cut / # edges inside What is “best” community of 5 nodes? 5

Score: Φ(S) = # edges cut / # edges inside Bad community Φ=5/6 = 0.83 What is “best” community of 5 nodes? 6

Score: Φ(S) = # edges cut / # edges inside Better community Φ=5/7 = 0.7 Bad community Φ=2/5 = 0.4 What is “best” community of 5 nodes? 7

Score: Φ(S) = # edges cut / # edges inside Better community Φ=5/7 = 0.7 Bad community Φ=2/5 = 0.4 Best community Φ=2/8 = 0.25 What is “best” community of 5 nodes? 8

 We define: Network community profile (NCP) plot Plot the score of best community of size k 9 Community size, log k log Φ(k) Φ(5)=0.25 Φ(7)=0.18 k=5 k=7

d-dimensional meshes Hierarchically nested clusters 10

 Zachary’s university karate club social network  During the study club split into 2  The split (squares vs. circles) corresponds to cut B 11

 Collaborations between scientists in Networks [Newman, 2005] 12

 Previous work mostly focused on community structure of small networks (~100 nodes)  We examined 108 different large networks 13

 Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges) 14

Φ(k), (conductance) k, (community size) Better and better communities Communities get worse and worse Best community has ~100 nodes 16

Small clusters on the edge of the network are responsible for downward part of NCP plot NCP plot Best cluster 17

 Each additional edge inside the cluster costs more: NCP plot Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children Φ=1/3 = 0.33 18

Network structure: Core-periphery (jellyfish, octopus) Whiskers are responsible for good communities Denser and denser core of the network Core contains ~60% nodes and ~80% edges 19

 What is a good model that explains such network structure? Pref. attachment Small World Geometric Pref. Attachment Flat Down and Flat Flat and Down 20

 Forest Fire [LKF05]: connections spread like a fire  New node joins the network  Selects a seed node  Connects to some of its neighbors  Continue recursively 21 Notes: Preferential attachment flavor - second neighbor is not uniform at random. Copying flavor - since burn seed’s neighbors. Hierarchical flavor - seed is parent. “Local” flavor - burn “near” -- in a diffusion sense -- the seed vertex. As community grows it blends into the core of the network

rewired network 22

 How does the size of best cluster scale with the size of the network? 23

24  Cluster size remains constant (even if one allows nesting) over time Linked in network over time

 Each dot is a different network 25

 The Dunbar number  150 individuals is maximum community size  What edges “mean” and community identification  Using node and edge types/attributes  Implications for machine learning  No large clusters  No/little (assortative) hierarchical structure  Can’t be well embedded – no underlying geometry 26

Joint work with Eric Horvitz, Microsoft Research 27

Milgram’s small world experiment 28  The Small-world experiment [Milgram ’67, Dodds- Muhamad-Watts ‘03]  People send letters from Nebraska to Boston  How many steps does it take?  6.2 on the average, thus “6 degrees of separation”

 1) Short paths exist in a social network  2) People are able to find them (using only partial knowledge of the network) Local search: forwarding a message t s d(s,t)=h Good nodes: d=h-1 Bad nodes: d≥h Target 29

 Contact (buddy) list  Messaging window 30

 We collected the data for June 2006 4.5Tb of compressed data:  245 million users logged in  180 million users engaged in conversations  255 billion exchanged messages  1 billion conversations / day 31

The network: 180M nodes, 1.3B undirected edges 32

33 MSN Messenger network Number of steps between pairs of people Avg. path length 6.6 90% of the people can be reached in < 8 hops HopsNodes 01 110 278 33,96 48,648 53,299,252 628,395,849 779,059,497 852,995,778 910,321,008 101,955,007 11518,410 12149,945 1344,616 1413,740 154,476 161,542 17536 18167 1971 2029 2116 2210 233 242 253

A node that exchanged messages with ~2 million people 34

Short paths exist and they are robust Randomized network (same degree distr.) All links Both way links 35

 What is the decision function that makes me forward the message to the target? t s d(s,t)=h Good nodes: d=h-1 Bad nodes: d≥h Target What are the characteristics of shortest paths? How hard is it to find them? 36

Probability of success if we forward to a random neighbor ts 40

ts Use a decision tree to learn a classifier: Model: 0.4128 Random : 0.0207 42

43 Green bar is prob. that node is good

 Pick a pair of nodes: start at s  Walk until hit the target t where next node is chosen: Search alg.% foundMean path length Random0.00083,709 MinGeoDist0.0282 778 MaxDeg0.01584,964 Deg/Geo 2 0.14462,676 Cntry0.0108 402 Cntry*Deg0.13133,114 Lang0.00551,699 Lang*Deg0.04963,163 Age0.00122,890 Age*Deg0.02035,324 t s It works! (in a network with 180 million nodes) 44 -- Milgram’s path completion is 29% -- Dodds,Muhhamad, Watts: 0.015% comp

 Why are networks the way they are?  Only recently have basic properties been observed on a large scale  Confirms social science intuitions; calls others into question  Benefits of working with large data  Observe structures not visible at smaller scales 45

Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

Similar presentations

Presentation on theme: "Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

Similar presentations

Presentation on theme: "Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,"— Presentation transcript:

Similar presentations

About project

Feedback