Download presentation

Presentation is loading. Please wait.

Published byElvis Bernard Modified about 1 year ago

1
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996

2
Agenda Intro & Motivation Algorithm sketch The estimation framework Estimating reachability Estimating neighborhood sizes

3
Introduction o Descendant counting problem: “Given a directed graph G compute for each node number of nodes reachable from it and the total size of the transitive closure”

4
Introduction - set of nodes reachable from node Transitive closure size: Example: |S(‘A’)|=5, |S(‘B’)|=3 T=|S(‘A’)|+|S(‘B’)|+…= 15 A D C B E

5
Motivation Applicable for DB-query size estimations Data mining Matrixes multiplications optimizations Parallel DFS algorithms optimizations

6
Framework algorithm sketch Least descendant mapping Given graph G(V,E) with ranks on it’s nodes compute a mapping for each node v in V to the least-ranked node in S(v) A4A4 D2D2 C5C5 B1B1 E3E3 Example: LE(‘A’) = 1 LE(‘C’) = 2

7
Framework algorithm sketch The LE (least element) is highly correlated with size of S(v) !! The precision can be improved by applying several iterations with random ranks assignment and recalculation of LE

8
The estimation framework Let X be a set of elements x with non- negative weights w(x). Let Y be a set of labels y, and mapping S: from labels y to subsets of x Our object is to compute an estimate on: - assuming X,Y and weights are given but it’s costly to calculate w(S(y)) for all y’s

9
The estimation framework Assume we have the following LE (LeastElement) Oracle: given ranks R(x) on elements of X, LE(y) returns element with minimal rank in S(y) in O(1) time: The estimation algorithm will perform k iterations, where k is determined by required precision

10
The estimation framework Iteration: Independently, for each x in X select a random rank R(x) from exponential distribution with parameter w(x) Exponential distribution function will be: Apply LE on selected ranking and store obtained min-ranks for each y in Y

11
The estimation framework Proposition: The distribution of minimum rank R(le(y)) depends only on w(S(y)) Proof: The min of k r.v.’s with distribution with parameters has distribution with parameter Our objective now is to estimate distribution parameter from given samples

12
The estimation framework Mean of exponentially distributed with parameter λ r.e.’s is: 1/λ We can use this fact to estimate λ from samples by 1/(samples mean) Use this to estimate w(S(y)) from minimal ranks obtained from k iterations:

13
The estimation framework More estimators: Selecting k(1-1/e) –smallest sample of k samples. (Like median for uniform distribution) Using this non-intuitive average estimator:

14
The estimation framework Complexity so far: Allowing relative tolerated error ε we need to store significant bits for R’s k assignment iterations will take O(k|X|) time + k*O(Oracle setup time) Asymptotic accuracy bounds (the proof will go later)

15
Estimating reachability Objective: Given graph G(V,E) for each v estimate number of its descendants and size of transitive closure: All we need is to implement an Oracle for calculating LE mapping. Following algorithm inputs arbitrary ranking of nodes in sorted order and does this in O(|E|) time:

16
Estimating reachability LE subroutine() Reverse edges direction of the graph Iterate until V = {} Pop v with minimal rank from V Run DFS to find all nodes reachable from v (call this set of nodes U) For each node in U set LE == v V = V \ U E = E \ {edges incident to nodes in U}

17
Estimating reachability Each estimation iteration takes O(|V|) + O(|E|) assuming we can sort nodes ranks in expected linear time. Accuracy bounds (from estimator bounds)

18
Estimating neighborhood sizes Problem: Given graph G(V,E) with nonnegative edges lengths should be able to give an estimation for number of nodes within distance of at most d from node v – n(v,d) Our algorithm will preprocess G in time and after that will be able to answer (v,d) queries in time

19
Estimating neighborhood sizes N(A,7)={A,B,C,D,E} N(A,3)={A,C,E} N(D,0)={D} N(C,∞)={C} n(A,7)=5 n(A,3)=3 n(D,0)=1 n(C,∞)=1 A4A4 D2D2 C5C5 B1B1 E3E

20
Estimating neighborhood sizes After preprocessing of G we will generate for each node v a list of pairs: ({d1,s1}, {d2,s2},…,{dη,sη}), where d’s stays for distances and s’s stays for estimated neighborhoods sizes. The lists will be sorted by d’s. To obtain n(v,d) we’ll look for a pair i such that and return

21
Estimating neighborhood sizes The algorithm will run k iterations, in each iteration it will create for each node in G a least-element list ( {d1,v1}, {d2,v2},…,{dη,vη}) such that for any neighborhood (v,d) we will be able to find a min-rank node using the list: for min-rank node will be:

22
Estimating neighborhood sizes Neighborhoods: N(A,7)={A,B,C,D,E} N(A,3)={A,C,E} N(D,1)={C,D} N(C,∞)={C} LE-lists: A: ({A,0}{E,1}{D,2}{B,4}) B: ({B,0}) C: ({C,0}) D: ({D,0}) E: ({E,0}{D,3}) A4A4 D2D2 C5C5 B1B1 E3E

23
Estimating neighborhood sizes - alg sub Make_le_lists() Assume nodes are sorted by rank in increasing order Reverse edge direction of G For i=1..n:, For i=1..n (modified Dijkstra’s alg.) DO: (next slide)

24
Estimating neighborhood sizes - alg I. Start with empty heap, place on heap with label 0 II. Iterate until the heap is empty: Pop node v k with minimal label d from the heap Add pair to v k ’s LE-list, set For each out-edge of v k : If is in the heap – update its label to Else: if place on the heap with label

25
B1B1 ∞ Estimating neighborhood sizes - demo A4A4 D2D2 C5C5 E3E B:0 AA:0E:1D:2B:4 BB:0 CC:0 DD:0 EE:0D:3 A:4 D:0 A:2 E:3 E:0 A:0 C:0 A:1 ∞4 0 ∞ ∞ ∞

26
Estimating neighborhood sizes - analysis Correctness Proposition 1: A node v is placed on heap in iteration i if an only if If v is placed on the heap in iteration i, then the pair is placed on v’s list and the value d is updated to be

27
Estimating neighborhood sizes - analysis Complexity Proposition 2: If the ranking is a random permutation, the expected size of LE-lists is O(log(|V|) The proof is based on proposition 1 and divide&conquer style analysis -

28
Estimating neighborhood sizes - analysis (proof cont) Assume LE-list of node u contains x pairs. Consider nodes v sorted by their distance to node u: v1,v2,…. According to preposition 1 node v will enter heap at iteration i iff all the nodes with lower ranks are farer from u than is. Random ranks are expected to partition v1,v2,… sequence such that rank i will be nearer to u than about half of nodes with ranks > i. It follows that x is ~ O( log|V| )

29
Estimating neighborhood sizes - analysis Complexity (cont) Running time: Using Fibonacci heaps we have O(log|V|) pop() operation and O(1) insert() or update(). Let be a number of iterations in which was placed on the heap (0*
*

30
Estimating neighborhood sizes K – iterations issues What to do with obtained k LE-lists per node? Naïve way brings us to O(k*loglog|V|) time. It can be improved to O(logk + loglog|V|) by merging the lists and storing sums of ranks / breakpoint. Total algorithm setup time is:

31
This page has intentionally left blank

32
Summary General size-estimation framework Two applications – transitive closure size estimation and neighborhoods size estimation

33
A4A4 D2D2 C5C5 B1B1 E3E THE END!

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google