Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

Slides:

Advertisements

Similar presentations

Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.

Advertisements

Group Recommendation: Semantics and Efficiency

Web Information Retrieval

 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.

Approximations of points and polygonal chains

Introduction to Algorithms

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.

Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.

Efficiency of Algorithms

Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.

6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.

Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,

1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine.

Top-k and Skyline Computation in Database Systems

Aggregation Algorithms and Instance Optimality

Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.

Information Networks Rank Aggregation Lecture 10.

Iterative Improvement Algorithm 2012/03/20. Outline Local Search Algorithms Hill-Climbing Search Simulated Annealing Search Local Beam Search Genetic.

Efficient Processing of Top-k Spatial Preference Queries

1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.

Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Lecture 1- Query Processing Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

Combining Fuzzy Information: An Overview Ronald Fagin.

Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.

Real-Time systems By Dr. Amin Danial Asham.

Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

Query Processing CS 405G Introduction to Database Systems.

Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.

Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.

Lecture 3: Uninformed Search

Indexing & querying text

Database Management System

Seung-won Hwang, Kevin Chen-Chuan Chang

Indexing & querying text

Chapter 12: Query Processing

Top-k Query Processing

Rank Aggregation.

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Laks V.S. Lakshmanan Depf. of CS UBC

Popular Ranking Algorithms

Evaluation of Relational Operations: Other Techniques

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Efficient Processing of Top-k Spatial Preference Queries

Query Specific Ranking

Presentation transcript:

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta

Why top-k query processing Multimedia brings fuzzy data attribute values are graded typically [0,1] No clear boundary between “answer” / “no answer” A query in a multimedia database means combining graded attributes Combine attributes by aggregation function Aggregation function gives overall grade of object Return k objects with highest overall grade Example:

Top-k query processing = Finding k objects that have the highest overall grades How ?  Which algorithms? Fagin’s Algorithm (FA) Threshold Algorithm (TA) Which is the best algorithm? Keep in mind: Database system serves as middleware Multimedia (objects) may be kept in different subsystems e.g. photoDB, videoDB, search engine Take into account the limitations of these subsystems Top-k query processing

Simple database model Simple query Explaining Fagin’s Algorithm (FA) Finding top-k with FA Explaining Threshold Algortihm (TA) Finding top-k with TA Example

(a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) Sorted L 1 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) N a b c d Object ID Attribute Attribute M Sorted L 2 Example – Simple Database model

Find the top 2 (k = 2) objects on the following ‘query’ executed on the middleware: A1 & A2 (eg: color=red & shape=round) Example – Simple Query Aggregation function: function that gives objects an overall grade based on attribute grades examples : min, max functions Monotonicity! A1 & A2 as a ‘query’ to the middleware results in the middelware combining the grades of A1 en A2 by min(A1, A2)

c ID A1A1 A2A2 Min(A 1,A 2 ) STEP 1 Read attributes from every sorted list Stop when k objects have been seen in common from all lists (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d b Example – Fagin’s Algorithm

c IDA1A1 A2A2 Min(A 1,A 2 ) STEP 2 Random access to find missing grades (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d b Example – Fagin’s Algortihm

c IDA1A1 A2A2 Min(A 1,A 2 ) STEP 3 Compute the grades of the seen objects. Return the k highest graded objects. (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d b Example – Fagin’s Algortihm

Read all grades of an object once seen from a sorted access No need to wait until the lists give k common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers. How do we know that grades of seen objects are higher than the grades of unseen objects ? Predict maximum possible grade unseen objects: a: 0.9 b: 0.8 c: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: f: 0.65 d: 0.6 f: 0.6 Seen Possibly unseen Threshold value New Idea !!! Threshold Algorithm (TA) T = min(0.72, 0.7) = 0.7

IDA1A1 A2A2 Min(A 1,A 2 ) Step 1: - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) a d T = min(0.9, 0.9) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) Step 1 (Again): - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer b Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) a b T = min(0.8, 0.85) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Situation at stopping condition a b T = min(0.72, 0.7) = 0.7 Example – Threshold Algorithm

Comparison of Fagin’s and Threshold Algorithm TA sees less objects than FA TA stops at least as early as FA When we have seen k objects in common in FA, their grades are higher or equal than the threshold in TA. TA may perform more random accesses than FA In TA, (m-1) random accesses for each object In FA, Random accesses are done at the end, only for missing grades TA requires only bounded buffer space (k) At the expense of more random seeks FA makes use of unbounded buffers

The best algorithm Which algorithm is the best: TA, FA?? Define “best” middleware cost concept of instance optimality Consider: wild guesses aggregation functions characteristics Monotone, strictly monotone, strict database restrictions distinctness property

middleware cost = cost for processing data subsystems = sc S + rc R A = class of algorithms, A Є A represents an algorithm D = legal inputs to algorithms (databases), D Є D represents a database Cost(A,D ) = middleware cost when running algorithm A over database D The best algorithm: concept of optimality Algorithm B is instance optimal over A and D if : B Є A and Cost(B,D ) = O(Cost(A,D )) A Є A, D Є D Which means that: Cost(B,D ) ≤ c. Cost(A,D ) + c’, A Є A, D Є D optimality ratio A A A

Intuitively: B instance optimal = always the best algorithm in A = always optimal In reality: always is “always”  we will exclude wild guesses algorithms Wild guess = random access on object not previously encounter by sorted access In practice not possible Database need to know ID to do random access If wild guesses allowed in A then no algorithm can be instance optimal Wild guesses can find top-k objects by k·m random accesses (k = #objects, m = #lists) The best algorithm: instance optimality & wild guesses

The best algorithm: aggregation functions Aggregation function t combines object grades into object’s overall grade: x 1,…,x m t(x 1,…,x m ) Monotone : t(x 1,…,x m ) ≤ t(x’ 1,…,x’ m ) if x i ≤ x’ i for every i Strictly monotone: t(x 1,…,x m ) < t(x’ 1,…,x’ m ) if x i < x’ i for every i Strict: t(x 1,…,x m ) = 1 precisely when x i = 1 for every i

The best algorithm: database restrictions Distinctness property: A database has no (sorted) attribute list in which two objects have the same grade

The best algorithm: Fagin’s Algorithm - Database with N objects, each with m attributes. - Orderings of lists are independent FA finds top-k with middleware cost O(N (m-1)/m k 1/m ) FA = optimal with high probability in the worst case for strict monotone aggregation functions

TA = instance optimal (always optimal) for every monotone aggregation function, over every database (excluding wild guesses) = optimal in much stronger sense than Fagin’s Algorithm If strict monotone aggregation function: Optimality ratio = m + m (m-1)c R /c s = best possible (m = # attributes) If random acces not possible (c r = 0 )  optimality ratio = m If sorted access not possible (c s = 0)  optimality ratio = infinite  TA not instance optimal TA = instance optimal (always optimal) for every strictly monotone aggregation function, over every database (including wild guesses) that satisfies the distinctness property Optimality ratio = cm 2 with c = max {c R /c S, c S /c R } The best algorithm: Threshold Algorithm

What if sorted access is restricted ?e.g. use distance database TA z What if random access not possible? e.g. web search engine No Random Access Algorithm What if we want only the approximate top k objects? TA θ What if we consider relative costs of random and sorted access? Combined Algorithm (between TA and NRA) Extending TA

NRA What if we also want the scores? What if we also want the scores?

Combined Algorithm (CA) CA in instance optimal

Approximation -approximation to the top k answers for the aggregation function t is a collection of k objects (each along with its grade) such that for each y among these k objects and each z not among these k objects, t(y)>=t(z) T : As soon as at least k objects have been seen whose grade is at least equal to threshold/ then halt.

?