Aggregation Algorithms and Instance Optimality

Aggregation Algorithms and Instance Optimality
Moni Naor Weizmann Institute Joint work with Ron Fagin Amnon Lotem

Aggregating information from several lists/sources
Define the problem Ways to evaluate algorithms New algorithms Further Research

The problem Database D of N objects
An object R has m fields - (x1, x2, , xm) Each xi  0,1 The objects are given in m lists L1, L2, , Lm list Li all objects sorted by xi value. An aggregation function t(x1,x2,…xm) t(x1,x2,…xm) - a monotone increasing function Wanted: top k objects according to t

Goal Touch as few objects as possible Access to object? List L1
r2= 0.75 r1= 0.5 b2= 0.3 a1= 0.4 c2= 0.2

Where? Problem arises when combining information from several sources/criteria Concentrate on middleware complexity without changing subsystems

Example: Combining Fuzzy Information
Lists are results of query: ``find object with color `red’ and shape `round’” Subsystems for color and for shape. Each returns a score in [0,1] for each object Aggregating function t is how the middleware system should combine the two criteria Example: t(R=(x1,x2 )) could be min(x1,x2 )

Example: scheduling pages
Each object - page in a data broadcast system 1st field -  of users requesting the page 2nd field - longest time user is waiting Combining function t - product of the two fields (geometric mean) Goal: find the page with the largest product

Example: Information Retrieval
Documents D1 D2 Dk T2 T1 Tn Terms W12 Query T1, T2, T3: find documents with largest sum of entries Aggregation function t is  xi

Modes of Access to the Lists
Sequential/sorted access: obtain next object in list Li cost cS Random access: for object R and i m obtain xi cost cR Cost of an execution: cS  ( of seq. access)  cR  (  of random access)

Interesting Cases cR /cS is small cS  cR or cR >> cS
Number of lists m - small

Fagin’s Algorithm - FA For all lists L1, L2, , Lm get next object in sorted order. Stop when there is set of k objects that appeared in all lists. For every object R encountered retrieve all fields x1, x2, , xm. Compute t(x1,x2,…xm) Return top k objects

Correctness of FA... For any monotone t and any database D of objects, FA finds the top k objects. Proof: any object in the real top k is better in at least one field than the objects in intersection.

Performance of FA Performance : assuming that the fields are independent (N(m-1)/m). Better performance - correlation between fields Worse performance - negative correlation Bad aggregating function: max

Goals of this work Improve complexity and analysis - worst case not meaningful Instead consider Instance Optimality Expand the range of functions want to handle all monotone aggregating functions Simplify implementation

cost(A,D)  c1 cost(A’,D)  c2.
Instance Optimality A = class of algorithms, D = class of legal inputs. For AA and DD measure cost(A,D) 0. An algorithm AA is instance optimal over A and D if there are constants c1 and c2 s.t. For every A’A and DD cost(A,D)  c1 cost(A’,D)  c2. c1 is called the optimality ratio

Offline  Nondeterminism
…Instance Optimality Common in competitive online analysis Compare an online decision making algorithm to the best offline one. Approximation Algorithms Compare the size that the best algorithm can find to the one the approx. algorithm finds In our case Offline  Nondeterminism

…Instance Optimality We show algorithms that are instance optimal for a variety of Classes of algorithms deterministic, Probabilistic, Approximate Databases access cost functions

Guidelines for Design of Algorithms
Format: do sequential/sorted access (with random access on other fields) until you know that you have seen the top k. In general: greedy gathering of information; If a query might allow you to know top k objects do it. Works in all considered scenarios

The Threshold Algorithm - TA
For all lists L1, L2, , Lm get next object in sorted order. For each object R returned Retrieve all fields x1,x2,,xm. Compute t(x1,x2,…xm) If one of top k answers so far - remember it. 1im let xi be bottom value seen in Li (so far) Define the threshold value  to be t(x1,x2,…xm) Stop when found k objects with t value  . Return top k objects

Maintained Information
Example: m=2, k=1, t is min Top object (so far) = Bottom values x1 = x2 = Threshold t = c , t(c) = 1/12 r , t(r) =1/8 b , t(b) = 1/11 0.9 0.7 0.4 Maintained Information 0.1 3/4 1/2 2/3 1/4 0.4 2/3 3/4 0.1 c1= 0.9 s2= 3/4 c = (0.9, 1/12) s = (0.05,3/4) b1= 0.7 w2= 2/3 w = (0.07, 2/3) b = (0.7, 1/11) r1= 0.4 z2= 1/2 r = (0.4, 1/8) z = (0.09, 1/2) a1= 0.1 q2= 1/4 q = (0.08, 1/4) a = (0.1, 1/13)

Correctness of TA  t(z1, z2,…zm)  t(x1,x2,…xm)  
For any monotone t and any database D of objects, TA finds the top k objects. Proof: If object z was not seen  1im zi  xi  t(z1, z2,…zm)  t(x1,x2,…xm)  

Implementation of TA Requires only bounded buffers: Top k objects
Bottom m values x1,x2,…xm

Robustness of TA Approximation: Suppose want an (1) approx.
- for any R returned and R’ not returned t(R’)  (1) t(R) Modified stopping condition: Stop when found k objects with t value at least /(1). Early Stopping: can modify TA so that at any point user is Given current view of top k list Given a guarantee about  approximation

Instance Optimality Intuition: Cannot stop any sooner, since the next object to be explored might have the threshold value. But, life is a bit more delicate...

Wild Guesses Wild guesses: random access for a field i of object R that has not been sequentially accessed before Neither FA nor TA use wild guesses Subsystem might not allow wild guesses More exotic queries: jth position in ith list...

Instance Optimality- No Wild Guesses
Theorem: For any monotone t let A be the class of algorithms that correctly find top k answers for every database with aggregation function t. Do not make wild guesses D be the class of all databases. Then TA is instance optimal over A and D Optimality ratio is m+m2 ·cR/cS - best possible!

Proof of Optimality Claim: If TA gets to iteration d, then any (correct) algorithm A’ must get to depth d-1 Proof: let Rmax be top object returned by TA (d)  t(Rmax)  (d-1) There exists D’ with R’ at level d-1 R’ (x1(d-1), x2 (d-1),…xm(d-1) ) Where A’ fails

Do wild guesses help? Aggregation function - min, k=1
Database … n n1 … 2n1 1 1 … …0 0 0 … …1 L1 : 1 2 … n n1 … 2n1 L2 : 2n1 … n1 n …1 Wild guess: access object n1 and top elements

t(x1, x2,…xm)  t(x’1,x’2,…x’m)
Strict Monotonicity An aggregation function t is strictly monotone if when 1im xi  x’i Then t(x1, x2,…xm)  t(x’1,x’2,…x’m) Examples: min, max, avg...

Instance Optimality - Wild Guesses
Theorem: For any strictly monotone t let A be the class of algorithms that correctly find top k answers for every database. D be the class of all databases with distinct values in each field. Then TA is instance optimal over A and D Optimality Ratio is c · m where c=max{cR /cS ,cS /cR }

Related Work An algorithm similar to TA was discovered independently by two other groups Nepal and Ramakrishna Gntzer, Balke and Kiessling No instance optimality analysis Hence proposed modifications that are not instance optimal algorithm Power of Abstraction?

Dealing with the Cost of Random Access
In some scenarios random access may be impossible Cannot ask a major search engine for it internal score on some document In some scenarios random access may be expensive Cost corresponds to disk access (seq. vs. random) Need algorithms to deal with these scenarios NRA - No Random Access CA - Combined Algorithm

No Random Access - NRA March down the lists getting the next object
Maintain: For any object R with discovered fields S1,..,m: W(R)  t(x1,x2,…,x|S|,,0…0) Worst (smallest) value t(R) can obtain B(R)  t(x1,x2,…,x|S|, x|S|+1,, …, xm) Best (largest) value t(R) can obtain

…maintained information (NRA)
Top k list, based on k largest W(R) seen so far Ties broken according to B values Define Mk to be the kth largest W(R) in top k list An object R is viable if B(R)  Mk Stop when there are no viable elements left I.e. B(R)  Mk for all R top list Return the top k list

Correctness  no other objects with t(R)  Ck
For any monotone t and any database D of objects, NRA finds the top k objects. Proof: At any point, for all objects t(R)B(R) Once B(R)  Ck for all but top list  no other objects with t(R)  Ck

Optimality Theorem: For any monotone t let
A be the class of algorithms that correctly find top k answers for every database. make only sequential access D be the class of all databases. Then NRA is instance optimal over A and D Optimality Ratio is m !

Implementation of NRA Not so simple - need to update B(R) for all existing R when x1,x2,…xm changes For specific aggregation functions (min) good data structures Open Problem: Which aggregation function have good data structures?

Combined Algorithm CA Maintain information as in NRA
Can combine TA and NRA Let h = cR /cS Maintain information as in NRA For every h sequential accesses: Do m random access on an objects from each list. Choose top viable for which not all fields are known

Instance Optimality Instance optimality statement a bit more complex
Under certain assumptions (including t = min, sum) CA is instance optimal with optimality ratio ~ 2m

Further Research Middleware Scenario:
Better implementations of NRA Is large storage essential Additional useful information in each list? How widely applicable is instance optimality? String Matching, Stable Marriage... Aggregation functions and methods in other scenarios Rank Aggregation of Search Engines P=NP?

More Details See

Aggregation Algorithms and Instance Optimality

Similar presentations

Presentation on theme: "Aggregation Algorithms and Instance Optimality"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Aggregation Algorithms and Instance Optimality

Similar presentations

Presentation on theme: "Aggregation Algorithms and Instance Optimality"— Presentation transcript:

Similar presentations

About project

Feedback