Top-k Query Processing

Top-k Query Processing
Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta

Why top-k query processing
Multimedia brings fuzzy data attribute values are graded typically [0,1] No clear boundary between “answer” / “no answer” A query in a multimedia database means combining graded attributes Combine attributes by aggregation function Aggregation function gives overall grade of object Return k objects with highest overall grade Example: Sheets hiervoor, over concepten enzo

Top-k query processing
= Finding k objects that have the highest overall grades How ?  Which algorithms? Fagin’s Algorithm (FA) Threshold Algorithm (TA) Which is the best algorithm? Mention that due to the short amount of time we have for the prensentation we can’t discuss the No Random Access algorithm and the Combined Algorithm Keep in mind: Database system serves as middleware Multimedia (objects) may be kept in different subsystems e.g. photoDB, videoDB, search engine Take into account the limitations of these subsystems

Example Simple database model Simple query
Explaining Fagin’s Algorithm (FA) Finding top-k with FA Explaining Threshold Algortihm (TA) Finding top-k with TA

Example – Simple Database model
N a b c d . Object ID 0.9 0.8 0.72 0.6 Attribute 1 0.85 0.2 Attribute 2 0.7 M Sorted L1 Sorted L2 (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) . We will start by introducing the database model used in the paper. A database has only one relation. Hence we have one table containiing n objects having m, in this case 2, attributes. Each object has a grade for each attribute. The same database can be represented by sorted lists for each attribute, ordered by grade. The entries of these list contain an id and a grade.

Example – Simple Query Find the top 2 (k = 2) objects on the following ‘query’ executed on the middleware: A1 & A2 (eg: color=red & shape=round) A1 & A2 as a ‘query’ to the middleware results in the middelware combining the grades of A1 en A2 by min(A1, A2) Now let’s look at an example : Find the top 2 objects on the following query….. Aggregation function: function that gives objects an overall grade based on attribute grades examples : min, max functions Monotonicity!

Example – Fagin’s Algorithm
STEP 1 Read attributes from every sorted list Stop when k objects have been seen in common from all lists (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . L1 L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) ID A1 A2 Min(A1,A2) a 0.9 0.85 d 0.9 b 0.8 0.7 0.72 c

Example – Fagin’s Algortihm
STEP 2 Random access to find missing grades (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . L1 L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) c ID A1 A2 Min(A1,A2) a 0.9 0.85 d 0.6 0.9 b 0.8 0.7 0.72 0.2

Example – Fagin’s Algortihm
STEP 3 Compute the grades of the seen objects. Return the k highest graded objects. L1 L2 (a, 0.9) (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) . c ID A1 A2 Min(A1,A2) (b, 0.8) a (c, 0.72) 0.9 0.85 0.85 0.6 d 0.6 . 0.9 b 0.8 0.7 0.7 0.72 0.2 0.2 (d, 0.6)

New Idea !!! Threshold Algorithm (TA)
Read all grades of an object once seen from a sorted access No need to wait until the lists give k common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers. How do we know that grades of seen objects are higher than the grades of unseen objects ? Predict maximum possible grade unseen objects: L1 L2 a: 0.9 d: 0.9 a: 0.85 b: 0.7 c: 0.2 . Seen b: 0.8 c: 0.72 T = min(0.72, 0.7) = 0.7 . f: 0.6 f: 0.65 Possibly unseen Threshold value d: 0.6

Example – Threshold Algorithm
Step 1: - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . L1 L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) ID A1 A2 Min(A1,A2) a 0.9 0.85 0.85 d 0.6 0.9 0.6

Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 a: 0.9 b: 0.8 c: 0.72 d: 0.6 . L1 L2 d: 0.9 a: 0.85 b: 0.7 c: 0.2 ID A1 A2 Min(A1,A2) a d 0.9 0.85 0.85 0.6 0.6 T = min(0.9, 0.9) = 0.9

Step 1 (Again): - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . L1 L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) ID A1 A2 Min(A1,A2) a 0.9 0.85 0.85 d 0.6 0.9 0.6 Sorted acces = sequential access b 0.8 0.7 0.7

Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 a: 0.9 b: 0.8 c: 0.72 d: 0.6 . L1 L2 d: 0.9 a: 0.85 b: 0.7 c: 0.2 ID A1 A2 Min(A1,A2) a b 0.9 0.7 0.85 0.85 0.8 0.7 T = min(0.8, 0.85) = 0.8

Situation at stopping condition
Example – Threshold Algorithm Situation at stopping condition a: 0.9 b: 0.8 c: 0.72 d: 0.6 . L1 L2 d: 0.9 a: 0.85 b: 0.7 c: 0.2 ID A1 A2 Min(A1,A2) a b 0.9 0.7 0.85 0.85 0.8 0.7 T = min(0.72, 0.7) = 0.7

Comparison of Fagin’s and Threshold Algorithm
TA sees less objects than FA TA stops at least as early as FA When we have seen k objects in common in FA, their grades are higher or equal than the threshold in TA. TA may perform more random accesses than FA In TA, (m-1) random accesses for each object In FA, Random accesses are done at the end, only for missing grades TA requires only bounded buffer space (k) At the expense of more random seeks FA makes use of unbounded buffers When we have seen k objects in common, their grades are higher or equal than the threshold Still somewhat vague

Which algorithm is the best: TA, FA??
The best algorithm Which algorithm is the best: TA, FA?? Define “best” middleware cost concept of instance optimality Consider: wild guesses aggregation functions characteristics Monotone, strictly monotone, strict database restrictions distinctness property

The best algorithm: concept of optimality
A = class of algorithms, A Є A represents an algorithm D = legal inputs to algorithms (databases), D Є D represents a database middleware cost = cost for processing data subsystems = scS + rcR Cost(A,D ) = middleware cost when running algorithm A over database D Algorithm B is instance optimal over A and D if : B Є A and Cost(B,D ) = O(Cost(A,D )) A Є A, D Є D Which means that: Cost(B,D ) ≤ c . Cost(A,D ) + c’, A Є A, D Є D optimality ratio A In other word the middleware cost to run algorithm B is at most a constant times the middleware cost of any other algorithm A. This constant term is called the optimality ratio.

The best algorithm: instance optimality & wild guesses
Intuitively: B instance optimal = always the best algorithm in A = always optimal In reality: always is “always”  we will exclude wild guesses algorithms Wild guess = random access on object not previously encounter by sorted access In practice not possible Database need to know ID to do random access If wild guesses allowed in A then no algorithm can be instance optimal Wild guesses can find top-k objects by k·m random accesses (k = #objects , m = #lists) A wild guess means perform random access on object not previously encountered by sorted access With wild guesses it is possible to determine the top k objects by only k random accesses. Since every other algortihm will need more then k accesses we can make the optimality ratio arbitrarily large. Moet dit nog beter checken…want er wordt later nog wel bewezen dat TA instance optimal over alle databases die Voldoen aan distinctness property (dus ook wild guess algorithmen), waarom kan het daar dan ineens weer wel?

The best algorithm: aggregation functions
Aggregation function t combines object grades into object’s overall grade: x1,…,xm t(x1,…,xm) Monotone : t(x1,…,xm) ≤ t(x’1,…,x’m) if xi ≤ x’i for every i Strictly monotone: t(x1,…,xm) < t(x’1,…,x’m) if xi < x’i for every i Strict: t(x1,…,xm) = 1 precisely when xi = 1 for every i

Distinctness property:
The best algorithm: database restrictions Distinctness property: A database has no (sorted) attribute list in which two objects have the same grade

The best algorithm: Fagin’s Algorithm
- Database with N objects, each with m attributes. - Orderings of lists are independent FA finds top-k with middleware cost O(N(m-1)/mk1/m) FA = optimal with high probability in the worst case for strict monotone aggregation functions

The best algorithm: Threshold Algorithm
TA = instance optimal (always optimal) for every monotone aggregation function, over every database (excluding wild guesses) = optimal in much stronger sense than Fagin’s Algorithm If strict monotone aggregation function: Optimality ratio = m + m (m-1)cR/cs = best possible (m = # attributes) If random acces not possible (cr = 0 )  optimality ratio = m If sorted access not possible (cs = 0)  optimality ratio = infinite  TA not instance optimal TA = instance optimal (always optimal) for every strictly monotone aggregation function, over every database (including wild guesses) that satisfies the distinctness property Optimality ratio = cm2 with c = max {cR/cS, cS/cR}

Extending TA What if sorted access is restricted ? e.g. use distance database TA z What if random access not possible? e.g. web search engine No Random Access Algorithm What if we want only the approximate top k objects? TAθ What if we consider relative costs of random and sorted access? Combined Algorithm (between TA and NRA)

NRA What if we also want the scores?

Combined Algorithm (CA)
CA in instance optimal

Approximation -approximation to the top k answers for the aggregation function t is a collection of k objects (each along with its grade) such that for each y among these k objects and each z not among these k objects,  t(y)>=t(z) T  : As soon as at least k objects have been seen whose grade is at least equal to threshold/  then halt.

Top-k Query Processing

Similar presentations

Presentation on theme: "Top-k Query Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Top-k Query Processing

Similar presentations

Presentation on theme: "Top-k Query Processing"— Presentation transcript:

Similar presentations

About project

Feedback