# Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

## Presentation on theme: "Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)"— Presentation transcript:

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

 Which pair of nodes {i,j} should be connected?  Variant: node i is given Friend suggestion in Facebook Alice Bob Charlie Movie recommendation in Netflix

 Predict link between nodes With the minimum number of hops With max common neighbors (length 2 paths) 8 followers 1000 followers Prolific common friends  Less evidence Less prolific  Much more evidence Alice Bob Charlie The Adamic/Adar score gives more weight to low degree common neighbors.

 Predict link between nodes With the minimum number of hops With more common neighbors (length 2 paths) With larger Adamic/Adar With more short paths (e.g. length 3 paths ) …

RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy* *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 How do we justify these observations? Especially if the graph is sparse

6 Nodes are uniformly distributed in a latent space The problem of link prediction is to find the nearest neighbor who is not currently linked to the node.  Equivalent to inferring distances in the latent space Raftery et al.’s Model: Unit volume universe Points close in this space are more likely to be connected.

7 1 ½ Higher probability of linking Two sources of randomness Point positions: uniform in D dimensional space Linkage probability: logistic with parameters α, r α, r and D are known radius r α determines the steepness

RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph is sparse

 Pr 2 (i,j) = Pr(common neighbor|d ij ) Product of two logistic probabilities, integrated over a volume determined by d ij As α  ∞ Logistic  Step function Much easier to analyze! i j

10 Everyone has same radius r i j Empirical Bernstein Bounds on distance V(r)=volume of radius r in D dims η =Number of common neighbors Unit volume universe

 OPT = node closest to i  MAX = node with max common neighbors with i  Theorem: d OPT ≤ d MAX ≤ d OPT + 2[ ε/V(1)] 1/D ε = c 1 (var N /N) ½ + c 2 /(N -1 ) D=dimensionality w.h.p Common neighbors is an asymptotically optimal heuristic as N  ∞

 Node k has radius r k.  i  k if d ik ≤ r k (Directed graph)  r k captures popularity of node k 12 i k j Type 1: i  k  j riri rjrj A(r i, r j,d ij ) Type 2: i  k  j i k j rkrk rkrk A(r k, r k,d ij )

i j k η 1 ~ Bin[N 1, A(r 1, r 1, d ij )] η 2 ~ Bin[N 2, A(r 2, r 2, d ij )] Example graph:  N 1 nodes of radius r 1 and N 2 nodes of radius r 2  r 1 << r 2 Maximize Pr[ η 1, η 2 | d ij ] = product of two binomials w(r 1 ) E[ η 1 |d*] + w(r 2 ) E[ η 2 |d*] = w(r 1 ) η 1 + w(r 2 ) η 2 RHS ↑  LHS ↑  d* ↓

{ Variance Jacobian Small variance  Presence is more surprising r is close to max radius Small variance  Absence is more surprising Adamic/Adar 1/r Real world graphs generally fall in this range

RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 Especially if the graph is sparse

 Common neighbors = 2 hop paths  Analysis of longer paths: two components 1. Bounding E( η l | d ij ). [η l = # l hop paths]  Bounds Pr l (i,j) by using triangle inequality on a series of common neighbor probabilities. 2. η l ≈ E( η l | d ij ) Triangulation

 Common neighbors = 2 hop paths  Analysis of longer paths: two components 1. Bounding E( η l | d ij ) [η l = # l hop paths]  Bounds Pr l (i,j) by using triangle inequality on a series of common neighbor probabilities. 2. η l ≈ E( η l | d ij ) Bounded dependence of η l on position of each node  Can use McDiarmid’s inequality to bound | η l - E( η l | d ij )|

 Bound d ij as a function of η l using McDiarmid’s inequality.  For l’ ≥ l we need η l’ >> η l to obtain similar bounds  Also, we can obtain much tighter bounds for long paths if shorter paths are known to exist.

1 ½ Factor ¼ weak bound for Logistic Can be made tighter, as logistic approaches the step function.

 Three key ingredients 1. Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001 2. Triangle inequality holds  necessary to extend to l hop paths 3. Points are spread uniformly at random  Otherwise properties will depend on location as well as distance

RandomShortest Path Common Neighbors Adamic/AdarEnsemble of short paths Link prediction accuracy* *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007 The number of paths matters, not the length For large dense graphs, common neighbors are enough Differentiating between different degrees is important In sparse graphs, length 3 or more paths help in prediction.

23 Generative model Link Prediction Heuristics node a Most likely neighbor of node i ? node b Compare A few properties  Can justify the empirical observations  We also offer some new prediction algorithms

 Combine bounds from different radii  But there might not be enough data to obtain individual bounds from each radius  New sweep estimator  Q r = Fraction of nodes w. radius ≤ r, which are common neighbors.  Higher Q r  smaller d ij w.h.p

 Q r = Fraction of nodes w. radius ≤ r, which are common neighbors larger Q r  smaller d ij w.h.p  T R : = Fraction of nodes w. radius ≥ R, which are common neighbors.  Smaller T R  large d ij w.h.p

Q r = Fraction of nodes with radius ≤ r which are common neighbors T R = Fraction of nodes with radius ≥ R which are common neighbors Number of common neighbors of a given radius Large Q r  small d ij Small T R  large d ij r