A Thin Monitoring Layer for Top-k Aggregation Queries over a Database Foteini AlvanakiSebastian Michel Saarland University DBRank 2013, Riva Del Garda, Italy
Data Cube sum(price*quantity)
Data Cube BrandProduct TypeCountrysum(Price*Quantity) Brand1Type1Country11234 Brand1Type2Country13522 Brand1Type11234 Brand1Type23522 Brand1Country14756 Type1Country11234 Type2Country13522 Brand14756 Type11234 Type23522 Country What are the top-2 product types with the highest revenue of each brand in each country? 2.What are the top-2 brands with the highest revenue in each country?
Top-k Queries Primary Attribute: The attribute/dimension over which the selection is performed (e.g. product type) Secondary Attributes: Used to filter specific results (e.g. brand, country) Aggregated Attributes: Used to compute an aggregated score (e.g. price, quantity) Aggregate Function: e.g. sum One top-k query for each combination of secondary attribute instances (filtering condition)
Filtering Conditions: Example (1) brand={X}-country = {Y, W} brand=X AND country=Y brand=X AND country=W SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=W GROUP BY type ORDER BY SUM(price*quantity) LIMIT K
Filtering Conditions: Example (2) country = {Y, W}-brand={X} country=Y country=W brand=X SELECT type, SUM(price*quantity) FROM relation WHERE country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE country=W GROUP BY type ORDER BY SUM(price*quantity) LIMIT K SELECT type, SUM(price*quantity) FROM relation WHERE brand=X GROUP BY type ORDER BY SUM(price*quantity) LIMIT K
Filtering Conditions: Example (3) country = {Y, W}-brand={X} SELECT type, SUM(price*quantity) FROM relation GROUP BY type ORDER BY SUM(price*quantity) LIMIT K
Updates Insertions to the underlying database that contain all information related to the top-k queries INSERT INTO relation (type, brand, country, price, quantity) VALUES (T, X, Y, 100, 3)
Problem How to maintain all these queries in the presence of fast updates?
Outline Setting/Problem Algorithms – Naïve Approach – Estimates Approach – Groups Approach Experimental Results Conclusions
Example SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y GROUP BY type ORDER BY SUM(price*quantity) LIMIT 2 Update: (type, X, Y, 300)
Naïve Approach Case 1: type in the top-2, e.g. (B,X,Y,300) TypeScore A3452 B TypeScore A3452 B2706 Case 2: type NOT in the top-2, e.g. (K,X,Y,300) Verification Query: SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type=K GROUP BY type
Estimates Approach In-memory Structures top-(k+N) instances with exact aggregated scores B instances with estimated aggregated scores best possible score (basic score) + inserted values TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore O1990 P2112 Q2076 R1997 Buffer
Estimates Approach Case 1.1: type in the top-2, e.g. (B,X,Y,300) TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore A3452 B2706 C2356 D2167 E1987 top-2 top
Estimates Approach Case 1.2: type in the top-5, e.g. (D,X,Y,300) TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore A3452 B2406 C2356 D2467 E1987 top-2 top TypeScore A3452 D2467 B2406 C2356 E1987
Estimates Approach Case 2: type in the Buffer, e.g. (P,X,Y,300) TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore O1990 P2112 Q2076 R1997 Buffer +300 TypeScore O1990 P2412 Q2076 R1997 Buffer Verification Query: SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type=P GROUP BY type
Estimates Approach Sub-case 2.1: score(P) < score(E), e.g. score(P) = 756 TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore O1990 P756 Q2076 R1997 Buffer TypeScore O1990 Q2076 R1997
Estimates Approach Sub-case 2.2: score(P) > score(E), e.g. score(P) = 2178 TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore O1990 P2178 Q2076 R1997 Buffer TypeScore A3452 B2406 C2356 D2167 P2178 TypeScore O1990 Q2076 R1997
Estimates Approach Sub-case 2.3: score(P) > score(B), e.g. score(P) = 2407 TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore O1990 P2407 Q2076 R1997 Buffer TypeScore A3452 P2407 B2406 C2356 D2167 TypeScore O1990 Q2076 R1997
Estimates Approach Buffer Full Reset Query Estimated Score(T) = basic score = 2287 Case 3: type NOT in in-memory structures, e.g. (T,X,Y,300) SELECT type, SUM(price*quantity) FROM relation WHERE brand=X AND country=Y AND type IN (O,P,Q,R) GROUP BY type
Estimates Approach score(O)=1254, score(P)=432, score(Q)=2050, score(R)=1990 TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore O1990 P2112 Q2076 R1997 Buffer TypeScore T2287 TypeScore A3452 B2406 C2356 D2167 Q2050 Case 3: type NOT in in-memory structures, e.g. (T,X,Y,300)
Queries Characteristics SAME primary attribute SAME aggregate attributes SAME aggregate function SAME top-k condition DIFFERENT filtering condition
Lattice organisation
Groups Approach The updates are forwarded from top to bottom in the lattice Each ranking forwards the queried results to the rankings lying in lower levels in the lattice
Groups Approach: Example SELECT type, SUM(price*quantity) FROM relation WHERE brand=X GROUP BY type ORDER BY SUM(price*quantity) LIMIT 2 Update: (type, X, Y, 300) Ranking: brand=X, country=ANY
Groups Approach Case 2: type in the Buffer, e.g. (P,X,Y,300) TypeScore A3452 B2406 C2356 D2167 E1987 top-2 top-5 TypeScore O1990 P2112 Q2076 R1997 Buffer +300 TypeScore O1990 P2412 Q2076 R1997 Buffer Verification Query
Groups Approach Case 2: type in the Buffer, e.g. (P,X,Y,300) Verification Query: SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type=P Buffer Reset Query: SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type IN (O,P,Q,R) Case 4: type NOT in in-memory structures, e.g. (T,X,Y,300)
Groups Approach Tuples (type, brand, country, price*quantity) limited to those satisfying its filtering condition Uses them to compute the scores. Forwards them to the rankings lower in lattice Rankings receiving tuples use those qualifying to their filtering condition to compute the scores
Groups Approach: Verification Query SELECT brand, country, price*quantity FROM relation WHERE brand=X AND type=P A set with (brand, country, price*quantity) tuples limited to those that have brand=X Uses them to compute score(P). Forwards them to the rankings lower in lattice ({brand=X, country=Y}, {brand=X, country=W})
Groups Approach Buffer Full Reset Query Estimated Score(T) = score(E) = 2287 SELECT type, brand, country, price*quantity FROM relation WHERE brand=X AND type IN (O,P,Q,R) Case 4: type NOT in in-memory structures, e.g. (T,X,Y,300)
Outline Problem Algorithms Naïve Approach Estimates Approach Groups Approach Experimental Results Conclusions
Experiments (1) TPC-H data Select on part.p_partKey (200,000 unique values) Filter on customer.c_mktsegment, orders.o_orderpriority and region.r_name Aggregation sum on lineitem.l_quantity 216 total rankings 30,000 updates/insertions
Experiments (2) Updates Random: inserts quantity between 1 and 50 for a random part.p_partKey : inserts quantity between 1 and 50 for a part.p_partKey selected according to the rule N-extra Gap Difference between top-k and top-(k+N) scores 100% (1*50) and 200% (2*50)
80-20 Updates: Queries
80-20 Updates: Time
Random Updates: Queries
Random Updates: Time
Naïve Approach updates: 239,985 Verification Queries, 4 secs/update Random updates: 239,977 Verification Queries, 4 secs/update
Outline Problem Algorithms Naïve Approach Estimates Approach Groups Approach Experimental Results Conclusions
Conclusion Two algorithms to maintain top-k rankings in the presence of fast updates arriving in an underlying database Exact top-k results Faster than a Naïve approach while Groups Approach limits further the communication with the database Preliminary results which provide insights on the impact of the various parameters in the effectiveness of our methods
Thank you!
Additional Instances