Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Nearest-neighbor Search in Disk-resident Graphs 报告人:鲁轶奇.

Similar presentations


Presentation on theme: "Fast Nearest-neighbor Search in Disk-resident Graphs 报告人:鲁轶奇."— Presentation transcript:

1 Fast Nearest-neighbor Search in Disk-resident Graphs 报告人:鲁轶奇

2 IBM – China Research Lab Outline  Introduction  Background & related works  Proposed Work  Experiments

3 IBM – China Research Lab Introduction-Motivation  Graph becoming enormous  Streaming algorithm must take passes over the entire dataset  Other perform clever preprocessing which use a specific similarity measure  This paper introduces analysis and algorithms which try to address the scalability problem in a generalizable way: not specific to one kind of graph partitioning nor one specific proximity measure.

4 IBM – China Research Lab Introduction-Motivation(cont.)  Real world graphs contain high-degree nodes  Computing node value by combining that of its neighbors.  Whenever a high degree node is encountered, these algorithm have to examine a much large neighborhood leading to severely degraded performance.

5 IBM – China Research Lab Introduction-Motivation(cont.)  Algorithms can no longer assume that entire graph can be stored in memory.  Compression techniques still have at least three setting where these might not work  social networks are far less compressible than Web graphs  decompression might lead to an unacceptable increase in query response time  even if a graph could be compressed down to a gigabyte, it might be undesirable to keep it in memory on a machine which is running other applications

6 IBM – China Research Lab Contribution  a simple transform of the graph (turning high degree nodes into sinks)  a deterministic local algorithm guaranteed to return nearest neighbors in personalized pagerank from the disk-resident clustered graph.  we develop a fully external-memory clustering algorithm (RWDISK) that uses only sequential sweeps over data files.

7 IBM – China Research Lab Background-Personalized Pagerank  A random walk starting at node a, at any step the walk can be reset to the start node with probability α  PPV(a, j) : PPV entry from a to j  Large value indicates high similarity

8 IBM – China Research Lab Background-Clustering  Using random walk based approaches for computing good quality local graph partition near a given anchor node.  Main intuition:  A random walk started inside a low conductance cluster will mostly stay inside the cluster.  Conductance:  Ф V (A) denote conductance and μ(A)=Σ i ∈ A degree(i)

9 IBM – China Research Lab Proposed Work  First problem: most local algorithms for computing nearest neighbors suffer from the presence of high degree nodes.  Second issue: computing proximity measures on large disk-resident graphs.  Third issue: Finding a good clustering

10 IBM – China Research Lab Effect of high degree nodes  High degree nodes are performance bottleneck  Effect on personalized pagerank  Main intuition  Main intuition: a very high degree node passes on a small fraction of its value to the out-neighbors, which might not be significant enough to invest our computing resources on.  Argue  Argue: stopping a random walk at a high degree node does not change the personalized pagerank value at other nodes which have relatively smaller degree.

11 IBM – China Research Lab Effect of high degree nodes  error incurred in personalized pagerank is inversely proportional to the degree of the sink node.

12 IBM – China Research Lab Effect of high degree nodes  fa α (i, j) is simply the probability of hitting a node j for the first time from node i, in this α-discounted walk.

13 IBM – China Research Lab Effect of high degree nodes

14 IBM – China Research Lab Effect of high degree nodes  the error for introducing a set of sink nodes

15 IBM – China Research Lab Nearest-neighbors on clustered graphs  how to use the clusters for deterministic computation of nodes "close" to an arbitrary query  how to use the clusters for deterministic computation of nodes "close" to an arbitrary query.  Use degree-normalized personalized pagerank  For a given node i, the PPV from j to it, i.e. PPV (j, i) can be written as

16 IBM – China Research Lab assume that j and i are in the same cluster S.  Don’t have access to PPV -1 (k),, replace it with upper and lower bound  lower bound: 0, we pretend that S is completely disconnected to the rest of the graph  Upper bound : A random walk from outside S has to cross the boundary of S to hit node i.

17 IBM – China Research Lab  S is small in size, the power method suffice  At each iteration, maintain the upper and lower bounds for nodes within S  To expand S: bring in the clusters for x of the external neighbors of  this global upper boundfalls below a pre-specified small threshold γ  In reality, using an additive slack ε, (ub k+1 - ε)

18 IBM – China Research Lab Ranking Step  return all nodes which have lower bound greater than the (k+1)th largest upper bound  Why: All nodes outside the cluster are guaranteed to have personalized pagerank smaller than the global upperbound, which is smaller than γ

19 IBM – China Research Lab Clustered Representation on Disk  Intuition: use a set of anchor nodes and assign each remaining node to its “closest” anchor.  Using personalized page-rank as the measure of “closeness”  Algorithm:  Start with a random set of anchors  Iteratively add new anchors from the set of unreachable nodes, and the recompute the cluster assignments  Two properties:  new anchors are far away from the existing anchors  when the algorithm terminates, each node i is guaranteed to be assigned to its closest anchor.

20 IBM – China Research Lab RWDISK  4 kinds of files  Edge file: Each line represents an edge by a triplet {src,dst,p}, p = P(X t = dst| X t-1 =src)  Last file: each line in Last is {src,anchor,value}, value= P(X t-1 =src| X 0 =anchor)  Newt file: Newt contains xt, each line is {src,anchor,value}, where value equals P(X t =src|X 0 =anchor)  Ans file: represents the values for vt. Thus each line in Ans is {src,anchor,value}, where value =  Algorithm to compute vt by power iterations

21 IBM – China Research Lab RWDISK(cont.)  Newt is simply a matrix-vector product between the transition matrix stored in Edges and Last.  File are stored lexicographically, this can be obtained by a file-join like algorithm.  First step: simply joins the two files, and accumulates the probability values at a node from its in-neighbors.  Next step: the Newt file is sorted and compressed, in order to add up the values from different in-neighbors  multiply the probabilities by α(1-α) t-1  Fix the number of iterations at maxiter.

22 IBM – China Research Lab  One major problem is that intermediate files can become much larger than the number of edges  in most real-world networks within 4-5 steps it is possible to reach a huge fraction of the whole graph  Intermediate file getting too large  Using rounding for reducing file sizes

23 IBM – China Research Lab Experiments  Dataset

24 IBM – China Research Lab Experiments(cont.)  System Detail  On a off-the-shelf PC  Least recently used replacement scheme  Page size 4KB

25 IBM – China Research Lab Experiments(cont.)-Effect of high degree nodes  Three-fold advantages: - Speed up external memory clustering - Reduce number of page-faults in random-walk simulation  Effect on RWDISK

26 IBM – China Research Lab Experiments(cont.)-Deterministic vs. Simulations  Computing top-10 neighbors with approximation slack for 500 randomly picked nodes  Citeseer original graph  DBLP turned nodes with degree above 1000 into sinks  LiveJournal turn nodes with degree above 100 into sinks

27 IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS  maxiter = 30, α = 0.1 and ε = for PPV  METIS for baseline algorithm  break DBLP into parts, which used 20GB of RAM  Break LiveJournal into parts, which used 50GB of RAM  In comparison, RWDISK can be excuted on a 2-4 GB standard PC

28 IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS  Measure of cluster quality  A good disk-based clustering must satisfy : - Low conductance - Fit in disk-sized pages

29 IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS

30 IBM – China Research Lab Experiments(cont.)-RWDISK vs. METIS


Download ppt "Fast Nearest-neighbor Search in Disk-resident Graphs 报告人:鲁轶奇."

Similar presentations


Ads by Google