# Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

## Presentation on theme: "Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996."— Presentation transcript:

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996

Agenda  Intro & Motivation  Algorithm sketch  The estimation framework  Estimating reachability  Estimating neighborhood sizes

Introduction o Descendant counting problem: “Given a directed graph G compute for each node number of nodes reachable from it and the total size of the transitive closure”

Introduction  - set of nodes reachable from node  Transitive closure size:  Example: |S(‘A’)|=5, |S(‘B’)|=3 T=|S(‘A’)|+|S(‘B’)|+…= 15 A D C B E

Motivation  Applicable for DB-query size estimations  Data mining  Matrixes multiplications optimizations  Parallel DFS algorithms optimizations

Framework algorithm sketch  Least descendant mapping Given graph G(V,E) with ranks on it’s nodes compute a mapping for each node v in V to the least-ranked node in S(v) A4A4 D2D2 C5C5 B1B1 E3E3 Example: LE(‘A’) = 1 LE(‘C’) = 2

Framework algorithm sketch  The LE (least element) is highly correlated with size of S(v) !!  The precision can be improved by applying several iterations with random ranks assignment and recalculation of LE

The estimation framework  Let X be a set of elements x with non- negative weights w(x).  Let Y be a set of labels y, and mapping S: from labels y to subsets of x  Our object is to compute an estimate on: - assuming X,Y and weights are given but it’s costly to calculate w(S(y)) for all y’s

The estimation framework  Assume we have the following LE (LeastElement) Oracle: given ranks R(x) on elements of X, LE(y) returns element with minimal rank in S(y) in O(1) time:  The estimation algorithm will perform k iterations, where k is determined by required precision

The estimation framework  Iteration: Independently, for each x in X select a random rank R(x) from exponential distribution with parameter w(x) Exponential distribution function will be: Apply LE on selected ranking and store obtained min-ranks for each y in Y

The estimation framework  Proposition: The distribution of minimum rank R(le(y)) depends only on w(S(y))  Proof: The min of k r.v.’s with distribution with parameters has distribution with parameter  Our objective now is to estimate distribution parameter from given samples

The estimation framework  Mean of exponentially distributed with parameter λ r.e.’s is: 1/λ  We can use this fact to estimate λ from samples by 1/(samples mean)  Use this to estimate w(S(y)) from minimal ranks obtained from k iterations:

The estimation framework  More estimators: Selecting k(1-1/e) –smallest sample of k samples. (Like median for uniform distribution) Using this non-intuitive average estimator:

The estimation framework  Complexity so far: Allowing relative tolerated error ε we need to store significant bits for R’s k assignment iterations will take O(k|X|) time + k*O(Oracle setup time)  Asymptotic accuracy bounds (the proof will go later)

Estimating reachability  Objective: Given graph G(V,E) for each v estimate number of its descendants and size of transitive closure:  All we need is to implement an Oracle for calculating LE mapping. Following algorithm inputs arbitrary ranking of nodes in sorted order and does this in O(|E|) time:

Estimating reachability  LE subroutine() Reverse edges direction of the graph Iterate until V = {}  Pop v with minimal rank from V  Run DFS to find all nodes reachable from v (call this set of nodes U)  For each node in U set LE == v  V = V \ U  E = E \ {edges incident to nodes in U}

Estimating reachability  Each estimation iteration takes O(|V|) + O(|E|) assuming we can sort nodes ranks in expected linear time.  Accuracy bounds (from estimator bounds)

Estimating neighborhood sizes  Problem: Given graph G(V,E) with nonnegative edges lengths should be able to give an estimation for number of nodes within distance of at most d from node v – n(v,d)  Our algorithm will preprocess G in time and after that will be able to answer (v,d) queries in time

Estimating neighborhood sizes  N(A,7)={A,B,C,D,E}  N(A,3)={A,C,E}  N(D,0)={D}  N(C,∞)={C}  n(A,7)=5  n(A,3)=3  n(D,0)=1  n(C,∞)=1 A4A4 D2D2 C5C5 B1B1 E3E3 1 2 4 3 1 1

Estimating neighborhood sizes  After preprocessing of G we will generate for each node v a list of pairs: ({d1,s1}, {d2,s2},…,{dη,sη}), where d’s stays for distances and s’s stays for estimated neighborhoods sizes. The lists will be sorted by d’s.  To obtain n(v,d) we’ll look for a pair i such that and return

Estimating neighborhood sizes  The algorithm will run k iterations, in each iteration it will create for each node in G a least-element list ( {d1,v1}, {d2,v2},…,{dη,vη}) such that for any neighborhood (v,d) we will be able to find a min-rank node using the list: for min-rank node will be:

Estimating neighborhood sizes Neighborhoods:  N(A,7)={A,B,C,D,E}  N(A,3)={A,C,E}  N(D,1)={C,D}  N(C,∞)={C} LE-lists:  A: ({A,0}{E,1}{D,2}{B,4})  B: ({B,0})  C: ({C,0})  D: ({D,0})  E: ({E,0}{D,3}) A4A4 D2D2 C5C5 B1B1 E3E3 1 2 4 3 1 1

Estimating neighborhood sizes - alg  sub Make_le_lists() Assume nodes are sorted by rank in increasing order Reverse edge direction of G For i=1..n:, For i=1..n (modified Dijkstra’s alg.) DO: (next slide)

Estimating neighborhood sizes - alg I. Start with empty heap, place on heap with label 0 II. Iterate until the heap is empty:  Pop node v k with minimal label d from the heap  Add pair to v k ’s LE-list, set For each out-edge of v k : If is in the heap – update its label to Else: if place on the heap with label

B1B1 ∞ Estimating neighborhood sizes - demo A4A4 D2D2 C5C5 E3E3 1 2 4 3 1 1 B:0 AA:0E:1D:2B:4 BB:0 CC:0 DD:0 EE:0D:3 A:4 D:0 A:2 E:3 E:0 A:0 C:0 A:1 ∞4 0 ∞ ∞ ∞ 21 0 03 0 0

Estimating neighborhood sizes - analysis  Correctness Proposition 1: A node v is placed on heap in iteration i if an only if If v is placed on the heap in iteration i, then the pair is placed on v’s list and the value d is updated to be

Estimating neighborhood sizes - analysis  Complexity Proposition 2: If the ranking is a random permutation, the expected size of LE-lists is O(log(|V|) The proof is based on proposition 1 and divide&conquer style analysis -

Estimating neighborhood sizes - analysis (proof cont) Assume LE-list of node u contains x pairs. Consider nodes v sorted by their distance to node u: v1,v2,…. According to preposition 1 node v will enter heap at iteration i iff all the nodes with lower ranks are farer from u than is. Random ranks are expected to partition v1,v2,… sequence such that rank i will be nearer to u than about half of nodes with ranks > i. It follows that x is ~ O( log|V| )

Estimating neighborhood sizes - analysis  Complexity (cont) Running time: Using Fibonacci heaps we have O(log|V|) pop() operation and O(1) insert() or update(). Let be a number of iterations in which was placed on the heap (0 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/11/3283573/slides/slide_29.jpg", "name": "Estimating neighborhood sizes - analysis  Complexity (cont) Running time: Using Fibonacci heaps we have O(log|V|) pop() operation and O(1) insert() or update().", "description": "Let be a number of iterations in which was placed on the heap (0

Estimating neighborhood sizes K – iterations issues What to do with obtained k LE-lists per node? Naïve way brings us to O(k*loglog|V|) time. It can be improved to O(logk + loglog|V|) by merging the lists and storing sums of ranks / breakpoint. Total algorithm setup time is: