Presentation is loading. Please wait.

Presentation is loading. Please wait.

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Similar presentations


Presentation on theme: "All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)"— Presentation transcript:

1 All right reserved by Xuehua Shen xshen@uiuc.edu 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

2 All right reserved by Xuehua Shen xshen@uiuc.edu 2 Problem: Rank Aggregation Each object is scored using m different criteria, m sorted list for each criterion Combined score is calculated by an aggregation function Problem: find top-k objects with highest combined scores

3 All right reserved by Xuehua Shen xshen@uiuc.edu 3 carIDMileage Score c1.0 a0.8 e0.6 b0.5 d carIDYear Score a0.9 b0.7 c d e0.5 carIDPrice Score d1.0 e0.9 b0.8 c0.7 a0.6 Rank Aggregation carIDscore d0.81 c0.76 Top 2 Car e.g. weighted sum Combined score = 0.2 *mileage score + 0.3*year score + 0.5 * price score Do we need access all entries of all sorted lists? Example

4 All right reserved by Xuehua Shen xshen@uiuc.edu 4 Applications Multimedia database system Web search query Query Rank Aggregation Engine Color=‘red’and Shape=‘round’ Top k Color = ‘red’ Sorted List color shape Shape =‘round’ From Zhang2002 talk

5 All right reserved by Xuehua Shen xshen@uiuc.edu 5 Outline Assumptions Fagin Algorithm Threshold Algorithm Summary & Comments

6 All right reserved by Xuehua Shen xshen@uiuc.edu 6 Assumption 1: Modes of Access Sequential Access: obtain score of an object in one sorted list sequentially from current position Random Access: obtain score of an object in one sorted list using one random access carIDYear score a0.8 c e0.7 … Assumption: Both Access Modes are available

7 All right reserved by Xuehua Shen xshen@uiuc.edu 7 Assumption 2: Aggregation Function Object gets different scores from different subsystems in the interval [0,1] Aggregation function to compute them into combined scores e.g. min, avg Monotone: if for every i

8 All right reserved by Xuehua Shen xshen@uiuc.edu 8 Intuition of Algorithms Top objects in individual sorted lists also have chances to be correct answers Do some accesses, and think “Can we stop now?”

9 All right reserved by Xuehua Shen xshen@uiuc.edu 9 Fagin Algorithm carIDPrice score a0.9 c0.8 e0.7 … carIDMileage score b1.0 e0.8 f0.7 … carIDYear score a0.8 c e0.7 … ’e’ appears in all of them. top-1 object must be in {a, b, c, e, f}. why? Monotone function, object ‘e’ blocks all objects below Do random access for these 5 objects to get their scores and pick Top-1. We can’t say ‘e’ must be top-1,other objects can still have higher combined score

10 All right reserved by Xuehua Shen xshen@uiuc.edu 10 Drawbacks of Fagin Algorithm Only use information provided by sorted list and monotone property Have to remember lots of objects: large buffer size

11 All right reserved by Xuehua Shen xshen@uiuc.edu 11 Threshold Algorithm (TA) When object R is seen under sequential access, immediately do random access to get all other scores of object R and compute combined score Halt when at least k objects have combined scores no less than upper bound Intuition: Combined score calculated by aggregation function can provide some extra information. upper bound (or threshold) of combined score of unseen objects! At the same time, Keep track of the upper bound of the unseen objects

12 All right reserved by Xuehua Shen xshen@uiuc.edu 12 TA: Example (K=1,AVG aggregation) carI D Price score a0.9 c0.8 e0.7 … carIDYear score a0.8 c e0.7 … carIDMileage score b1.0 e0.8 f0.7 … Step 1: sequential access ‘a’ price score(0.9), then random access ‘a’ mileage score(0.6) and year score(0.8), avg is (0.77) Step 2: sequential access ‘b’ mileage score(1.0), then random access ‘b’ price score(0.7) and year score(0.7), avg is (0.8) Upper Bound:0.9 0.77 Upper Bound:0.8 0.8 Const-size buffer

13 All right reserved by Xuehua Shen xshen@uiuc.edu 13 Evaluation of TA TA never stops later than FA TA requires only small constant-size (K) buffer However, TA may perform more random accesses

14 All right reserved by Xuehua Shen xshen@uiuc.edu 14 Summary FA and TA with both sequential access and random access Extend TA to other situations  Approximate algorithm  No random access

15 All right reserved by Xuehua Shen xshen@uiuc.edu 15 Comments Rely on universal identification of objects from different lists Assumptions can not always be valid e.g. not every sorted list exists beforehand Do sequential access wisely for speeding up TA for skewed data

16 All right reserved by Xuehua Shen xshen@uiuc.edu 16

17 All right reserved by Xuehua Shen xshen@uiuc.edu 17 Backup Slides

18 All right reserved by Xuehua Shen xshen@uiuc.edu 18 Middleware Middleware: functions as a translation layer, handles all incoming requests (such as Top-K query) and replies, interacting with the disparate back-office systems to gather the information it needs. Application developers don’t need know there are several heterogeneous systems behind the middleware.

19 All right reserved by Xuehua Shen xshen@uiuc.edu 19 Boolean Query Vs. Fuzzy Query Semantics  Get all the results that satisfy the conditions Vs. get the best possible answers to the query  Size of result: constant Vs. variable Processing the query  It’s possible to determine whether the tuple belongs to result only based on the tuple itself, but for fuzzy query it’s not. So for boolean query we can deal with each tuple individually, but for fuzzy query, we cannot determine whether it’s in the result just by itself

20 All right reserved by Xuehua Shen xshen@uiuc.edu 20 Fuzzy Query Processor (from Zhang02) Query Query Processor (Boolean) Title=‘database’ and Price <100 Query Query Processor (Fuzzy) Color=‘red’and Shape=‘round’ Set Top k Traditional Database Database with fuzzy data Color = ‘red’ Sorted List color shape Shape =‘round’

21 All right reserved by Xuehua Shen xshen@uiuc.edu 21 Cost Reduce the number of sequential access(Cs) Number of random accesses is bounded by sequential access by a factor of m-1 Overall cost is bounded by the Cs by constant factor Really optimal?

22 All right reserved by Xuehua Shen xshen@uiuc.edu 22 Approximation Algorithm Approximately top k answers are acceptable or even desirable θ-approximation (θ>1)  For any object y in the answer, z in database θt(y) >= t(z) Turning TA to approximate algorithm  The top k objects seen so far satisfy the inequality

23 All right reserved by Xuehua Shen xshen@uiuc.edu 23 Non Random Access (NRA) Similar as TA, except that  No exact score  No sorted order  The lower bound and upper bound of such objects Do sequential access until there are k objects whose lower bound no less than the upper bound of all other objects

24 All right reserved by Xuehua Shen xshen@uiuc.edu 24 NRA cont. Low Bound: use 0 Upper Bound: use last score seen carIDPrice score a0.9 c0.8 e0.7 … carIDMileage score b1.0 e0.8 f0.7 … carIDYear score a0.8 c e0.7 …

25 All right reserved by Xuehua Shen xshen@uiuc.edu 25 NRA example Advantage: R1(1,0), others(1/3,1/3) Top 1 Top 2 vs. Top 1: R1(1,0),R2(1,1/4),others(1/3,1/3) Top 2 Lots of Bookkeeping

26 All right reserved by Xuehua Shen xshen@uiuc.edu 26 Optimality of FA Assumption  t is monotone Cost  Θ(N (m-1)/m k 1/m ) with arbitrarily high probability Optimality  Each algorithm that correctly find the top k answers for strict monotone query F t (A 1, A 2, …,A m ) where A 1, A 2, …,A m are independent, and without wild guess has the cost Θ (N (m-1)/m k 1/m ) with arbitrarily high probability  FA is optimal in all such algorithms in high probability sense

27 All right reserved by Xuehua Shen xshen@uiuc.edu 27 Optimality of TA Assumption  t is monotone Instance Optimality  For any algorithm C that correctly find the top k answers for monotone query F t (A 1, A 2, …,A m ) without wild guess on any database D Cost(TA,D)=O(cost(C,D))  TA is instance optimal in all such algorithms

28 All right reserved by Xuehua Shen xshen@uiuc.edu 28 Optimality of NRA Assumption  t is monotone Instance Optimality  For all algorithm that correctly find the top k objects for monotone query t for every database and don’t make random access

29 All right reserved by Xuehua Shen xshen@uiuc.edu 29 Algorithm Comparision (from Zhang2002 talk) AlgorithmAssumptionAccess Model Termination Worst Case Termination Expected Buffer Space FAMonotoneSorted Random n(m-1)/m + k/m N m-1/m k 1/m N TAMonotoneSorted Random Bounded by FA Depends on distribution k NRAMonotoneSortedNDepends on distribution N

30 All right reserved by Xuehua Shen xshen@uiuc.edu 30 Worst Case O1O1 1.00.0 O2O2 1.00.0... O n+1 1.0 O n+2 0.01.0 O n+3 0.01.0... O 2n+1 0.01.0 Aggregation Function: min n(m-1)/m + k/m

31 All right reserved by Xuehua Shen xshen@uiuc.edu 31 Naïve algorithm Algorithm:  For each criterion, do sequential access to retrieve all objects and their scores  Calculate combined scores for all objects  Pick up top K Comments:  Access the entire database  Cost is linear in the database size  Does NOT use the fact that each list is sorted

32 All right reserved by Xuehua Shen xshen@uiuc.edu 32 Fagin Algorithm Algorithm: Do sequential in parallel to all sorted list Li, until there is k “matches”. A “match” is an object that has been seen in all sorted lists Li. Then for each object that has been seen, do random access to get all its score. Compute the combined scores and pick the top k


Download ppt "All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)"

Similar presentations


Ads by Google