All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Slides:



Advertisements
Similar presentations
Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.
Advertisements

Web Information Retrieval
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Efficient Query Evaluation on Probabilistic Databases
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
Heuristic alignment algorithms and cost matrices
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
1 8. Safe Query Languages Safe program – its semantics can be at least partially computed on any valid database input. Safety is tied to program verification,
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
CS 536 Spring Global Optimizations Lecture 23.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine.
Aggregation Algorithms and Instance Optimality
Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
The Marriage Problem Finding an Optimal Stopping Procedure.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Tonga Institute of Higher Education Design and Analysis of Algorithms IT 254 Lecture 8: Complexity Theory.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
CSC 413/513: Intro to Algorithms NP Completeness.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Information Networks Rank Aggregation Lecture 10.
CSC 211 Data Structures Lecture 13
Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Efficient Processing of Top-k Spatial Preference Queries
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CS4432: Database Systems II Query Processing- Part 2.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.
1 Distributed Vertex Coloring. 2 Vertex Coloring: each vertex is assigned a color.
Supporting Ranking and Clustering as Generalized Order-By and Group-By
Indexing & querying text
Supporting Ad-Hoc Ranking Aggregates
Top-k Query Processing
Rank Aggregation.
Laks V.S. Lakshmanan Depf. of CS UBC
Popular Ranking Algorithms
Models and Algorithms for Complex Networks
The Byzantine Secretary Problem
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
Query Specific Ranking
Outline Rank Aggregation Computing aggregate scores
Presentation transcript:

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

All right reserved by Xuehua Shen 2 Problem: Rank Aggregation Each object is scored using m different criteria, m sorted list for each criterion Combined score is calculated by an aggregation function Problem: find top-k objects with highest combined scores

All right reserved by Xuehua Shen 3 carIDMileage Score c1.0 a0.8 e0.6 b0.5 d carIDYear Score a0.9 b0.7 c d e0.5 carIDPrice Score d1.0 e0.9 b0.8 c0.7 a0.6 Rank Aggregation carIDscore d0.81 c0.76 Top 2 Car e.g. weighted sum Combined score = 0.2 *mileage score + 0.3*year score * price score Do we need access all entries of all sorted lists? Example

All right reserved by Xuehua Shen 4 Applications Multimedia database system Web search query Query Rank Aggregation Engine Color=‘red’and Shape=‘round’ Top k Color = ‘red’ Sorted List color shape Shape =‘round’ From Zhang2002 talk

All right reserved by Xuehua Shen 5 Outline Assumptions Fagin Algorithm Threshold Algorithm Summary & Comments

All right reserved by Xuehua Shen 6 Assumption 1: Modes of Access Sequential Access: obtain score of an object in one sorted list sequentially from current position Random Access: obtain score of an object in one sorted list using one random access carIDYear score a0.8 c e0.7 … Assumption: Both Access Modes are available

All right reserved by Xuehua Shen 7 Assumption 2: Aggregation Function Object gets different scores from different subsystems in the interval [0,1] Aggregation function to compute them into combined scores e.g. min, avg Monotone: if for every i

All right reserved by Xuehua Shen 8 Intuition of Algorithms Top objects in individual sorted lists also have chances to be correct answers Do some accesses, and think “Can we stop now?”

All right reserved by Xuehua Shen 9 Fagin Algorithm carIDPrice score a0.9 c0.8 e0.7 … carIDMileage score b1.0 e0.8 f0.7 … carIDYear score a0.8 c e0.7 … ’e’ appears in all of them. top-1 object must be in {a, b, c, e, f}. why? Monotone function, object ‘e’ blocks all objects below Do random access for these 5 objects to get their scores and pick Top-1. We can’t say ‘e’ must be top-1,other objects can still have higher combined score

All right reserved by Xuehua Shen 10 Drawbacks of Fagin Algorithm Only use information provided by sorted list and monotone property Have to remember lots of objects: large buffer size

All right reserved by Xuehua Shen 11 Threshold Algorithm (TA) When object R is seen under sequential access, immediately do random access to get all other scores of object R and compute combined score Halt when at least k objects have combined scores no less than upper bound Intuition: Combined score calculated by aggregation function can provide some extra information. upper bound (or threshold) of combined score of unseen objects! At the same time, Keep track of the upper bound of the unseen objects

All right reserved by Xuehua Shen 12 TA: Example (K=1,AVG aggregation) carI D Price score a0.9 c0.8 e0.7 … carIDYear score a0.8 c e0.7 … carIDMileage score b1.0 e0.8 f0.7 … Step 1: sequential access ‘a’ price score(0.9), then random access ‘a’ mileage score(0.6) and year score(0.8), avg is (0.77) Step 2: sequential access ‘b’ mileage score(1.0), then random access ‘b’ price score(0.7) and year score(0.7), avg is (0.8) Upper Bound: Upper Bound: Const-size buffer

All right reserved by Xuehua Shen 13 Evaluation of TA TA never stops later than FA TA requires only small constant-size (K) buffer However, TA may perform more random accesses

All right reserved by Xuehua Shen 14 Summary FA and TA with both sequential access and random access Extend TA to other situations  Approximate algorithm  No random access

All right reserved by Xuehua Shen 15 Comments Rely on universal identification of objects from different lists Assumptions can not always be valid e.g. not every sorted list exists beforehand Do sequential access wisely for speeding up TA for skewed data

All right reserved by Xuehua Shen 16

All right reserved by Xuehua Shen 17 Backup Slides

All right reserved by Xuehua Shen 18 Middleware Middleware: functions as a translation layer, handles all incoming requests (such as Top-K query) and replies, interacting with the disparate back-office systems to gather the information it needs. Application developers don’t need know there are several heterogeneous systems behind the middleware.

All right reserved by Xuehua Shen 19 Boolean Query Vs. Fuzzy Query Semantics  Get all the results that satisfy the conditions Vs. get the best possible answers to the query  Size of result: constant Vs. variable Processing the query  It’s possible to determine whether the tuple belongs to result only based on the tuple itself, but for fuzzy query it’s not. So for boolean query we can deal with each tuple individually, but for fuzzy query, we cannot determine whether it’s in the result just by itself

All right reserved by Xuehua Shen 20 Fuzzy Query Processor (from Zhang02) Query Query Processor (Boolean) Title=‘database’ and Price <100 Query Query Processor (Fuzzy) Color=‘red’and Shape=‘round’ Set Top k Traditional Database Database with fuzzy data Color = ‘red’ Sorted List color shape Shape =‘round’

All right reserved by Xuehua Shen 21 Cost Reduce the number of sequential access(Cs) Number of random accesses is bounded by sequential access by a factor of m-1 Overall cost is bounded by the Cs by constant factor Really optimal?

All right reserved by Xuehua Shen 22 Approximation Algorithm Approximately top k answers are acceptable or even desirable θ-approximation (θ>1)  For any object y in the answer, z in database θt(y) >= t(z) Turning TA to approximate algorithm  The top k objects seen so far satisfy the inequality

All right reserved by Xuehua Shen 23 Non Random Access (NRA) Similar as TA, except that  No exact score  No sorted order  The lower bound and upper bound of such objects Do sequential access until there are k objects whose lower bound no less than the upper bound of all other objects

All right reserved by Xuehua Shen 24 NRA cont. Low Bound: use 0 Upper Bound: use last score seen carIDPrice score a0.9 c0.8 e0.7 … carIDMileage score b1.0 e0.8 f0.7 … carIDYear score a0.8 c e0.7 …

All right reserved by Xuehua Shen 25 NRA example Advantage: R1(1,0), others(1/3,1/3) Top 1 Top 2 vs. Top 1: R1(1,0),R2(1,1/4),others(1/3,1/3) Top 2 Lots of Bookkeeping

All right reserved by Xuehua Shen 26 Optimality of FA Assumption  t is monotone Cost  Θ(N (m-1)/m k 1/m ) with arbitrarily high probability Optimality  Each algorithm that correctly find the top k answers for strict monotone query F t (A 1, A 2, …,A m ) where A 1, A 2, …,A m are independent, and without wild guess has the cost Θ (N (m-1)/m k 1/m ) with arbitrarily high probability  FA is optimal in all such algorithms in high probability sense

All right reserved by Xuehua Shen 27 Optimality of TA Assumption  t is monotone Instance Optimality  For any algorithm C that correctly find the top k answers for monotone query F t (A 1, A 2, …,A m ) without wild guess on any database D Cost(TA,D)=O(cost(C,D))  TA is instance optimal in all such algorithms

All right reserved by Xuehua Shen 28 Optimality of NRA Assumption  t is monotone Instance Optimality  For all algorithm that correctly find the top k objects for monotone query t for every database and don’t make random access

All right reserved by Xuehua Shen 29 Algorithm Comparision (from Zhang2002 talk) AlgorithmAssumptionAccess Model Termination Worst Case Termination Expected Buffer Space FAMonotoneSorted Random n(m-1)/m + k/m N m-1/m k 1/m N TAMonotoneSorted Random Bounded by FA Depends on distribution k NRAMonotoneSortedNDepends on distribution N

All right reserved by Xuehua Shen 30 Worst Case O1O O2O O n O n O n O 2n Aggregation Function: min n(m-1)/m + k/m

All right reserved by Xuehua Shen 31 Naïve algorithm Algorithm:  For each criterion, do sequential access to retrieve all objects and their scores  Calculate combined scores for all objects  Pick up top K Comments:  Access the entire database  Cost is linear in the database size  Does NOT use the fact that each list is sorted

All right reserved by Xuehua Shen 32 Fagin Algorithm Algorithm: Do sequential in parallel to all sorted list Li, until there is k “matches”. A “match” is an object that has been seen in all sorted lists Li. Then for each object that has been seen, do random access to get all its score. Compute the combined scores and pick the top k