1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez Rodriguez 1,2 Jure Leskovec 1 Andreas Krause 3

2 Hidden and implicit networks  Many social or information networks are implicit or hard to observe:  Hidden/hard-to-reach populations:  Network of needle sharing between the drug injection users  Implicit connections:  Network of information propagation in online news media  But can observe results of the processes taking place on such (invisible) networks:  Virus propagation:  Drug users get sick, and we observe when they see the doctor  Information networks:  We observe when media sites mention information  Question: Can we infer the hidden networks?

3 Inferring the Network  There is a directed social network over which diffusions take place:  But we do not observe the edges of the network  We only see the time when a node gets infected:  Cascade c 1 : (a, 1), (c, 2), (b, 6), (e,9)  Cascade c 2 : (c, 1), (a, 4), (b, 5), (d, 8)  Task: Want to infer the underlying network b b d d e e a a c c a a c c b b e e c c a a b b d d

4 Examples and Applications Virus propagation Word of mouth & Viral marketing Can we infer the underlying network? Viruses propagate through the network We only observe when people get sick But NOT who infected them Recommendations and influence propagate We only observe when people buy products But NOT who influenced them Process We observe It’s hidden

5 Our Problem Formulation  Plan for the talk: 1.Define a continuous time model of diffusion 2.Define the likelihood of the observed cascades given a network 3.Show how to efficiently compute the likelihood of cascades 4.Show how to efficiently find a graph G that maximizes the likelihood  Note:  There is a super-exponential number of graphs, O(N N*N )  Our method finds a near-optimal graph in O(N 2 )!

6 c c c c c c b b a a b b a a u u v v d d Cascade Diffusion Model  Continuous time cascade diffusion model:  Cascade c reaches node u at t u  And spreads to u’s neighbors:  with probability β cascade propagates along edge (u, v)  And we determine the infection time of node v t v = t u + Δ e.g.: Δ ~ Exponential or Power-law tata tbtb tctc Δ1Δ1 Δ2Δ2 We assume each node v has only one parent! Δ3Δ3 Δ4Δ4 tete tftf

7 Likelihood of a Single Cascade b b d d e e a a c c a a c c b b e e  Probability that cascade c propagates from node u to node v is: P c (u, v)  P(t v - t u )with t v > t u  Prob. that cascade c propagates in a tree pattern T:  Since not all nodes get infected by the diffusion process we introduce the external influence node m: P c (m, v) = ε m m εε ε Tree pattern T on cascade c: (a, 1), (b, 2), (c, 4), (e, 8)

8 Finding the Diffusion Network  There are many possible propagation trees that are consistent with the observed data:  c: (a, 1), (c, 2), (b, 3), (e, 4)  Need to consider all possible propagation trees T supported by the graph G: b b d d e e a a c c a a c c b b e e b b d d e e a a c c a a c c b b e e b b d d e e a a c c a a c c b b e e  Likelihood of a set of cascades C:  Want to find a graph: Good news Computing P(c|G) is tractable: Even though there are O(n n ) possible propagation trees. Matrix Tree Theorem can compute this in O(n 3 )! Bad news We actually want to search over graphs: There is a super- exponential number of graphs!

9 An Alternative Formulation  We consider only the most likely tree  Maximum log-likelihood for a cascade c under a graph G:  Log-likelihood of G given a set of cascades C: The problem is still intractable (NP-hard) But present an algorithm that finds near- optimal networks in O(N 2 )

10 Max Directed Spanning Tree Given a cascade c,  What is the most likely propagation tree? where Greedy parent selection of each node gives globally optimal tree!  A maximum directed spanning tree (MDST):  The sub-graph of G induced by the nodes in the cascade c is a DAG  Because edges point forward in time  For each node, just picks an in-edge of max- weight:

11 Objective function is Submodular  Theorem: Log-likelihood F c (G) of cascade c is monotonic, and submodular in the edges of the graph G Gain of adding an edge to a “small” graph Gain of adding an edge to a “large“ graph F c (A  {e}) – F c (A) ≥ F c (B  {e}) – F c (B)  Proof: s s A  B  VxV w w’ x A B j j o o  Single cascade c, edge e with weight x  Let w be max weight in-edge of s in A  Let w’ be max weight in-edge of s in B  We know: w ≤ w’  Now: F c (A  {e}) – F c (A) = max (w, x) – w ≥ max (w’, x) – w’ = F c (B  {e}) – F c (B) r r a a k k i i i i k k Then, log-likelihood F C (G) is monotonic, and submodular too

12 Finding the Diffusion Graph  Use the greedy hill-climbing to maximize F C (G):  For i=1…k:  At every step, pick the edge that maximizes the marginal improvement b b d d e e a a c c Localized update Marginal gains b a c a d a a b c b d b e b : 12 : 3 : 6 : 20 : 18 : 4 : 5 a c b c b d c d e d : 15 : 8 : 16 : 8 : 10 b e d e : 7 : 13 : 17 : 2 : 3 Localized update : 1 : 8 : 7 : 6 1. Approximation guarantee (≈ 0.63 of OPT) 2. Tight on-line bounds on the solution quality 3. Speed-ups: Lazy evaluation (by submodularity) Localized update (by the structure of the problem) Benefits:

13 Experimental Setup  We validate our method on:  How many edges of G can we find?  Precision-Recall  Break-even point  How many cascades do we need?  How fast is the algorithm?  How well do we optimize the likelihood F c (G)? Synthetic data Generate a graph G on k edges Generate cascades Record node infection times Reconstruct G Real data MemeTracker: 172m news articles Aug ’08 – Sept ‘09 343m textual phrases (quotes)

14  Small synthetic network: True network Baseline network Our method 14 Small Synthetic Example Pick k strongest edges:

15 Synthetic Networks  Performance does not depend on the network structure:  Synthetic Networks: Forest Fire, Kronecker, etc.  Transmission time distribution: Exponential, Power Law  Break-even point of > 90% 1024 node hierarchical Kronecker exponential transmission model 1000 node Forest Fire (α = 1.1) power law transmission model

16 How good is our graph?  We achieve ≈ 90 % of the best possible network!

17 How many cascades do we need?  With 2x as many infections as edges, the break-even point is already 0.8 - 0.9!

18 Running Time  Lazy evaluation and localized updates speed up 2 orders of magnitude!  Can infer a networks of 10k nodes in several hours

19 Real Data: Information diffusion  MemeTracker dataset:  172m news articles  Aug ’08 – Sept ‘09  343m textual phrases (quotes) http://memetracker.org  We infer the network of information diffusion:  Who tends to copy (repeat after) whom

20 Real Network  We use the hyperlinks between sites to generate the edges of a ground truth G  From the MemeTracker dataset, we have the timestamps of: 1. cascades of hyperlinks: time when a site creates a link 2. cascades of (MemeTracker) textual phrases time when site mentions the information Can we infer the hyperlinks network from… …times of when sites created links? …times when sites mentioned information? e e f f c c a a e e f f c c a a

21 Real Network: Performance 500 node hyperlink network using hyperlinks cascades 500 node hyperlink network using MemeTracker cascades  Break-even points of 50% for hyperlinks cascades and 30% for MemeTracker cascades!

22  5,000 news sites: Blogs Mainstream media Diffusion Network

23 Blogs Mainstream media Diffusion Network (small part)

24 Networks and Processes  We infer hidden networks based on diffusion data (timestamps)  Problem formulation in a maximum likelihood framework  NP-hard problem to solve exactly  We develop an approximation algorithm that:  It is efficient -> It runs in O(N 2 )  It is invariant to the structure of the underlying network  It gives a sub-optimal network with tight bound  Future work:  Learn both the network and the diffusion model  Applications to other domains: biology, neuroscience, etc.

25 Thanks! For more (Code & Data): http://snap.stanford.edu/netinf

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Similar presentations

Presentation on theme: "1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Similar presentations

Presentation on theme: "1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez."— Presentation transcript:

Similar presentations

About project

Feedback