Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

Similar presentations


Presentation on theme: "Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,"— Presentation transcript:

1 Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, Aniraban Dasgupta

2  Large on-line computing applications have detailed records of human activity:  On-line communities: Facebook (120 million)  Communication: Instant Messenger (~1 billion)  News and Social media: Blogging (250 million)  We model the data as a network (an interaction graph) Can observe and study phenomena at scales not possible before Communication network

3  The Small-world experiment: ▪ On a 240 million node communication network of Microsoft Instant Messenger  Small vs. large networks: ▪ Modeling community (cluster) structure of large networks 3 Zachary’s karate club (N=34) Tiny part of a large social network

4  How community like is a set of nodes?  Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure.  Conductance (normalized cut) Φ(S) = # edges cut / # edges inside  Small Φ(S) corresponds to more community-like sets of nodes S S’ 4

5 Score: Φ(S) = # edges cut / # edges inside What is “best” community of 5 nodes? 5

6 Score: Φ(S) = # edges cut / # edges inside Bad community Φ=5/6 = 0.83 What is “best” community of 5 nodes? 6

7 Score: Φ(S) = # edges cut / # edges inside Better community Φ=5/7 = 0.7 Bad community Φ=2/5 = 0.4 What is “best” community of 5 nodes? 7

8 Score: Φ(S) = # edges cut / # edges inside Better community Φ=5/7 = 0.7 Bad community Φ=2/5 = 0.4 Best community Φ=2/8 = 0.25 What is “best” community of 5 nodes? 8

9  We define: Network community profile (NCP) plot Plot the score of best community of size k 9 Community size, log k log Φ(k) Φ(5)=0.25 Φ(7)=0.18 k=5 k=7

10 d-dimensional meshes Hierarchically nested clusters 10

11  Zachary’s university karate club social network  During the study club split into 2  The split (squares vs. circles) corresponds to cut B 11

12  Collaborations between scientists in Networks [Newman, 2005] 12

13  Previous work mostly focused on community structure of small networks (~100 nodes)  We examined 108 different large networks 13

14  Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges) 14

15 15

16 Φ(k), (conductance) k, (community size) Better and better communities Communities get worse and worse Best community has ~100 nodes 16

17 Small clusters on the edge of the network are responsible for downward part of NCP plot NCP plot Best cluster 17

18  Each additional edge inside the cluster costs more: NCP plot Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children Φ=1/3 = 0.33 18

19 Network structure: Core-periphery (jellyfish, octopus) Whiskers are responsible for good communities Denser and denser core of the network Core contains ~60% nodes and ~80% edges 19

20  What is a good model that explains such network structure? Pref. attachment Small World Geometric Pref. Attachment Flat Down and Flat Flat and Down 20

21  Forest Fire [LKF05]: connections spread like a fire  New node joins the network  Selects a seed node  Connects to some of its neighbors  Continue recursively 21 Notes: Preferential attachment flavor - second neighbor is not uniform at random. Copying flavor - since burn seed’s neighbors. Hierarchical flavor - seed is parent. “Local” flavor - burn “near” -- in a diffusion sense -- the seed vertex. As community grows it blends into the core of the network

22 rewired network 22

23  How does the size of best cluster scale with the size of the network? 23

24 24  Cluster size remains constant (even if one allows nesting) over time Linked in network over time

25  Each dot is a different network 25

26  The Dunbar number  150 individuals is maximum community size  What edges “mean” and community identification  Using node and edge types/attributes  Implications for machine learning  No large clusters  No/little (assortative) hierarchical structure  Can’t be well embedded – no underlying geometry 26

27 Joint work with Eric Horvitz, Microsoft Research 27

28 Milgram’s small world experiment 28  The Small-world experiment [Milgram ’67, Dodds- Muhamad-Watts ‘03]  People send letters from Nebraska to Boston  How many steps does it take?  6.2 on the average, thus “6 degrees of separation”

29  1) Short paths exist in a social network  2) People are able to find them (using only partial knowledge of the network) Local search: forwarding a message t s d(s,t)=h Good nodes: d=h-1 Bad nodes: d≥h Target 29

30  Contact (buddy) list  Messaging window 30

31  We collected the data for June 2006 4.5Tb of compressed data:  245 million users logged in  180 million users engaged in conversations  255 billion exchanged messages  1 billion conversations / day 31

32 The network: 180M nodes, 1.3B undirected edges 32

33 33 MSN Messenger network Number of steps between pairs of people Avg. path length 6.6 90% of the people can be reached in < 8 hops HopsNodes 01 110 278 33,96 48,648 53,299,252 628,395,849 779,059,497 852,995,778 910,321,008 101,955,007 11518,410 12149,945 1344,616 1413,740 154,476 161,542 17536 18167 1971 2029 2116 2210 233 242 253

34 A node that exchanged messages with ~2 million people 34

35 Short paths exist and they are robust Randomized network (same degree distr.) All links Both way links 35

36  What is the decision function that makes me forward the message to the target? t s d(s,t)=h Good nodes: d=h-1 Bad nodes: d≥h Target What are the characteristics of shortest paths? How hard is it to find them? 36

37 ts 37

38 ts 38

39 39 ts

40 Probability of success if we forward to a random neighbor ts 40

41 ts 41

42 ts Use a decision tree to learn a classifier: Model: 0.4128 Random : 0.0207 42

43 43 Green bar is prob. that node is good

44  Pick a pair of nodes: start at s  Walk until hit the target t where next node is chosen: Search alg.% foundMean path length Random0.00083,709 MinGeoDist0.0282 778 MaxDeg0.01584,964 Deg/Geo 2 0.14462,676 Cntry0.0108 402 Cntry*Deg0.13133,114 Lang0.00551,699 Lang*Deg0.04963,163 Age0.00122,890 Age*Deg0.02035,324 t s It works! (in a network with 180 million nodes) 44 -- Milgram’s path completion is 29% -- Dodds,Muhhamad, Watts: 0.015% comp

45  Why are networks the way they are?  Only recently have basic properties been observed on a large scale  Confirms social science intuitions; calls others into question  Benefits of working with large data  Observe structures not visible at smaller scales 45


Download ppt "Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,"

Similar presentations


Ads by Google