Aggregation Algorithms and Instance Optimality

Slides:



Advertisements
Similar presentations
Web Information Retrieval
Advertisements

1 SOFSEM 2007 Weighted Nearest Neighbor Algorithms for the Graph Exploration Problem on Cycles Eiji Miyano Kyushu Institute of Technology, Japan Joint.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Time Analysis Since the time it takes to execute an algorithm usually depends on the size of the input, we express the algorithm's time complexity as a.
Greedy Algorithms Greed is good. (Some of the time)
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Search Strategies.  Tries – for word searchers, spell checking, spelling corrections  Digital Search Trees – for searching for frequent keys (in text,
Designing Algorithms Csci 107 Lecture 4. Outline Last time Computing 1+2+…+n Adding 2 n-digit numbers Today: More algorithms Sequential search Variations.
A Fairy Tale of Greedy Algorithms Yuli Ye Joint work with Allan Borodin, University of Toronto.
Planning under Uncertainty
CS 345 Data Mining Online algorithms Search advertising.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Zoë Abrams, Ashish Goel, Serge Plotkin Stanford University Set K-Cover Algorithms for Energy Efficient Monitoring in Wireless Sensor Networks.
Polynomial time approximation scheme Lecture 17: Mar 13.
CSE 421 Algorithms Richard Anderson Lecture 6 Greedy Algorithms.
Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.
10/31/02CSE Greedy Algorithms CSE Algorithms Greedy Algorithms.
The Marriage Problem Finding an Optimal Stopping Procedure.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
1 More Sorting radix sort bucket sort in-place sorting how fast can we sort?
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Complexity of algorithms Algorithms can be classified by the amount of time they need to complete compared to their input size. There is a wide variety:
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Scheduling policies for real- time embedded systems.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1 Online algorithms Typically when we solve problems and design algorithms we assume that we know all the data a priori. However in many practical situations.
Information Networks Rank Aggregation Lecture 10.
1 Relational Algebra and Calculas Chapter 4, Part A.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
CS4432: Database Systems II Query Processing- Part 2.
NP-Complete Problems Algorithm : Design & Analysis [23]
CSCE Database Systems Chapter 15: Query Execution 1.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
1 CS 430: Information Discovery Lecture 5 Ranking.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Models of Greedy Algorithms for Graph Problems Sashka Davis, UCSD Russell Impagliazzo, UCSD SIAM SODA 2004.
Indexing & querying text
Database Management System
Chapter 12: Query Processing
CSE 421: Introduction to Algorithms
Top-k Query Processing
Spatial Online Sampling and Aggregation
Rank Aggregation.
Objective of This Course
Laks V.S. Lakshmanan Depf. of CS UBC
Popular Ranking Algorithms
Richard Anderson Autumn 2006 Lecture 1
Greedy Algorithms / Caching Problem Yin Tat Lee
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Query Specific Ranking
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Aggregation Algorithms and Instance Optimality Moni Naor Weizmann Institute Joint work with Ron Fagin Amnon Lotem

Aggregating information from several lists/sources Define the problem Ways to evaluate algorithms New algorithms Further Research

The problem Database D of N objects An object R has m fields - (x1, x2, , xm) Each xi  0,1 The objects are given in m lists L1, L2, , Lm list Li all objects sorted by xi value. An aggregation function t(x1,x2,…xm) t(x1,x2,…xm) - a monotone increasing function Wanted: top k objects according to t

Goal Touch as few objects as possible Access to object? List L1 r2= 0.75 r1= 0.5 b2= 0.3 a1= 0.4 c2= 0.2

Where? Problem arises when combining information from several sources/criteria Concentrate on middleware complexity without changing subsystems

Example: Combining Fuzzy Information Lists are results of query: ``find object with color `red’ and shape `round’” Subsystems for color and for shape. Each returns a score in [0,1] for each object Aggregating function t is how the middleware system should combine the two criteria Example: t(R=(x1,x2 )) could be min(x1,x2 )

Example: scheduling pages Each object - page in a data broadcast system 1st field -  of users requesting the page 2nd field - longest time user is waiting Combining function t - product of the two fields (geometric mean) Goal: find the page with the largest product

Example: Information Retrieval Documents D1 D2 Dk T2 T1 Tn Terms W12 Query T1, T2, T3: find documents with largest sum of entries Aggregation function t is  xi

Modes of Access to the Lists Sequential/sorted access: obtain next object in list Li cost cS Random access: for object R and i m obtain xi cost cR Cost of an execution: cS  ( of seq. access)  cR  (  of random access)

Interesting Cases cR /cS is small cS  cR or cR >> cS Number of lists m - small

Fagin’s Algorithm - FA For all lists L1, L2, , Lm get next object in sorted order. Stop when there is set of k objects that appeared in all lists. For every object R encountered retrieve all fields x1, x2, , xm. Compute t(x1,x2,…xm) Return top k objects

Correctness of FA... For any monotone t and any database D of objects, FA finds the top k objects. Proof: any object in the real top k is better in at least one field than the objects in intersection.

Performance of FA Performance : assuming that the fields are independent (N(m-1)/m). Better performance - correlation between fields Worse performance - negative correlation Bad aggregating function: max

Goals of this work Improve complexity and analysis - worst case not meaningful Instead consider Instance Optimality Expand the range of functions want to handle all monotone aggregating functions Simplify implementation

cost(A,D)  c1 cost(A’,D)  c2. Instance Optimality A = class of algorithms, D = class of legal inputs. For AA and DD measure cost(A,D) 0. An algorithm AA is instance optimal over A and D if there are constants c1 and c2 s.t. For every A’A and DD cost(A,D)  c1 cost(A’,D)  c2. c1 is called the optimality ratio

Offline  Nondeterminism …Instance Optimality Common in competitive online analysis Compare an online decision making algorithm to the best offline one. Approximation Algorithms Compare the size that the best algorithm can find to the one the approx. algorithm finds In our case Offline  Nondeterminism

…Instance Optimality We show algorithms that are instance optimal for a variety of Classes of algorithms deterministic, Probabilistic, Approximate Databases access cost functions

Guidelines for Design of Algorithms Format: do sequential/sorted access (with random access on other fields) until you know that you have seen the top k. In general: greedy gathering of information; If a query might allow you to know top k objects do it. Works in all considered scenarios

The Threshold Algorithm - TA For all lists L1, L2, , Lm get next object in sorted order. For each object R returned Retrieve all fields x1,x2,,xm. Compute t(x1,x2,…xm) If one of top k answers so far - remember it. 1im let xi be bottom value seen in Li (so far) Define the threshold value  to be t(x1,x2,…xm) Stop when found k objects with t value  . Return top k objects

Maintained Information Example: m=2, k=1, t is min Top object (so far) = Bottom values x1 = x2 = Threshold t = c , t(c) = 1/12 r , t(r) =1/8 b , t(b) = 1/11 0.9 0.7 0.4 Maintained Information 0.1 3/4 1/2 2/3 1/4 0.4 2/3 3/4 0.1 c1= 0.9 s2= 3/4 c = (0.9, 1/12) s = (0.05,3/4) b1= 0.7 w2= 2/3 w = (0.07, 2/3) b = (0.7, 1/11) r1= 0.4 z2= 1/2 r = (0.4, 1/8) z = (0.09, 1/2) a1= 0.1 q2= 1/4 q = (0.08, 1/4) a = (0.1, 1/13)

Correctness of TA  t(z1, z2,…zm)  t(x1,x2,…xm)   For any monotone t and any database D of objects, TA finds the top k objects. Proof: If object z was not seen  1im zi  xi  t(z1, z2,…zm)  t(x1,x2,…xm)  

Implementation of TA Requires only bounded buffers: Top k objects Bottom m values x1,x2,…xm

Robustness of TA Approximation: Suppose want an (1) approx. - for any R returned and R’ not returned t(R’)  (1) t(R) Modified stopping condition: Stop when found k objects with t value at least /(1). Early Stopping: can modify TA so that at any point user is Given current view of top k list Given a guarantee about  approximation

Instance Optimality Intuition: Cannot stop any sooner, since the next object to be explored might have the threshold value. But, life is a bit more delicate...

Wild Guesses Wild guesses: random access for a field i of object R that has not been sequentially accessed before Neither FA nor TA use wild guesses Subsystem might not allow wild guesses More exotic queries: jth position in ith list...

Instance Optimality- No Wild Guesses Theorem: For any monotone t let A be the class of algorithms that correctly find top k answers for every database with aggregation function t. Do not make wild guesses D be the class of all databases. Then TA is instance optimal over A and D Optimality ratio is m+m2 ·cR/cS - best possible!

Proof of Optimality Claim: If TA gets to iteration d, then any (correct) algorithm A’ must get to depth d-1 Proof: let Rmax be top object returned by TA (d)  t(Rmax)  (d-1) There exists D’ with R’ at level d-1 R’ (x1(d-1), x2 (d-1),…xm(d-1) ) Where A’ fails

Do wild guesses help? Aggregation function - min, k=1 Database - 1 2 … n n1 … 2n1 1 1 … 1 1 0 0 …0 0 0 … 0 1 1 1 …1 L1 : 1 2 … n n1 … 2n1 L2 : 2n1 … n1 n …1 Wild guess: access object n1 and top elements

t(x1, x2,…xm)  t(x’1,x’2,…x’m) Strict Monotonicity An aggregation function t is strictly monotone if when 1im xi  x’i Then t(x1, x2,…xm)  t(x’1,x’2,…x’m) Examples: min, max, avg...

Instance Optimality - Wild Guesses Theorem: For any strictly monotone t let A be the class of algorithms that correctly find top k answers for every database. D be the class of all databases with distinct values in each field. Then TA is instance optimal over A and D Optimality Ratio is c · m where c=max{cR /cS ,cS /cR }

Related Work An algorithm similar to TA was discovered independently by two other groups Nepal and Ramakrishna Gntzer, Balke and Kiessling No instance optimality analysis Hence proposed modifications that are not instance optimal algorithm Power of Abstraction?

Dealing with the Cost of Random Access In some scenarios random access may be impossible Cannot ask a major search engine for it internal score on some document In some scenarios random access may be expensive Cost corresponds to disk access (seq. vs. random) Need algorithms to deal with these scenarios NRA - No Random Access CA - Combined Algorithm

No Random Access - NRA March down the lists getting the next object Maintain: For any object R with discovered fields S1,..,m: W(R)  t(x1,x2,…,x|S|,,0…0) Worst (smallest) value t(R) can obtain B(R)  t(x1,x2,…,x|S|, x|S|+1,, …, xm) Best (largest) value t(R) can obtain

…maintained information (NRA) Top k list, based on k largest W(R) seen so far Ties broken according to B values Define Mk to be the kth largest W(R) in top k list An object R is viable if B(R)  Mk Stop when there are no viable elements left I.e. B(R)  Mk for all R top list Return the top k list

Correctness  no other objects with t(R)  Ck For any monotone t and any database D of objects, NRA finds the top k objects. Proof: At any point, for all objects t(R)B(R) Once B(R)  Ck for all but top list  no other objects with t(R)  Ck

Optimality Theorem: For any monotone t let A be the class of algorithms that correctly find top k answers for every database. make only sequential access D be the class of all databases. Then NRA is instance optimal over A and D Optimality Ratio is m !

Implementation of NRA Not so simple - need to update B(R) for all existing R when x1,x2,…xm changes For specific aggregation functions (min) good data structures Open Problem: Which aggregation function have good data structures?

Combined Algorithm CA Maintain information as in NRA Can combine TA and NRA Let h = cR /cS Maintain information as in NRA For every h sequential accesses: Do m random access on an objects from each list. Choose top viable for which not all fields are known

Instance Optimality Instance optimality statement a bit more complex Under certain assumptions (including t = min, sum) CA is instance optimal with optimality ratio ~ 2m

Further Research Middleware Scenario: Better implementations of NRA Is large storage essential Additional useful information in each list? How widely applicable is instance optimality? String Matching, Stable Marriage... Aggregation functions and methods in other scenarios Rank Aggregation of Search Engines P=NP?

More Details See www.wisdom.weizmann.ac.il/~naor/PAPERS/middle_agg.html