Supporting Ad-Hoc Ranking Aggregates Chengkai Li (UIUC) joint work with Kevin Chang (UIUC) Ihab Ilyas (Waterloo)

Slides:



Advertisements
Similar presentations
Ranking Multimedia Databases via Relevance Feedback with History and Foresight Support / 12 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT AND EXPLORATION.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Web Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
RankSQL: Supporting Ranking Queries in RDBMS Chengkai Li (UIUC) Mohamed A. Soliman (Univ. of Waterloo) Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (Univ.
1 RankSQL: Query Algebra and Optimization for Relational Top-k Queries Chengkai Li (UIUC) joint work with Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (U.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Data Sources Data Warehouse Analysis Results Data visualisation Analytical tools OLAP Data Mining Overview of Business Intelligence Data visualisation.
Aggregation Algorithms and Instance Optimality
Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary.
Sorting and Query Processing Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 29, 2005.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Up the Odds when Gambling with Leads Modeling and Scoring Leads for Marketing in Education Presented at the DETC 2009 Workshop, Naples, FL October 20,
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
USING PREFERENCE CONSTRAINTS TO SOLVE MULTI-CRITERIA DECISION MAKING PROBLEMS Tanja Magoč, Martine Ceberio, and François Modave Computer Science Department,
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Efficient Processing of Top-k Spatial Preference Queries
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
More Relation Operations 2014, Fall Pusan National University Ki-Joune Li.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.
G. Marchionini, Univ. of Maryland Electronic Environments Cost Trends: Hardware cost < Software cost < Information cost < People time Virtuality (transcend.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Lecture Access – Queries. What’s a Query? A question you ask a database –ie: “Who are my Stockton customers?” –ie: “How much did Bob sell on the 14th?”
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Jianping Fan Department of Computer Science University of North Carolina at Charlotte Charlotte, NC Relevance Feedback for Image Retrieval.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign RankFP : A Framework for Rank Formulation and Processing Hwanjo Yu, Seung-won.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
WHIM- Spring ‘10 By:-Enza Desai. What is HCIR? Study of IR techniques that brings human intelligence into search process. Coined by Gary Marchionini.
Supporting Ranking and Clustering as Generalized Order-By and Group-By
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Supporting Ad-Hoc Ranking Aggregates
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Similarity Search: A Matching Based Approach
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Information Retrieval and Web Design
Recommender System.
Presentation transcript:

Supporting Ad-Hoc Ranking Aggregates Chengkai Li (UIUC) joint work with Kevin Chang (UIUC) Ihab Ilyas (Waterloo)

2 Ranking (Top-k) Queries Find the top k answers with respect to a ranking function, which often is the aggregation of multiple criteria. Ranking is important in many database applications: E-Commerce Find the best hotel deals by price, distance, etc. Multimedia Databases Find the most similar images by color, shape, texture, etc. Search Engine Find the most relevant records/documents/pages. OLAP, Decision Support Find the top profitable customers to send ads.

3 RankSQL: a RDBMS with Efficient Support of Ranking Queries Rank-Aware Query Operators [SIGMOD02, VLDB03] Algebraic Foundation and Optimization Framework [SIGMOD04, SIGMOD05] SPJ queries (SELECT … FROM … WHERE … ORDER BY …) Ad-Hoc Ranking Aggregate Queries [SIGMOD06] top k groups instead of tuples. (SELECT … FROM … WHERE … GROUP BY … ORDER BY …)

4 Example 1: Advertising an insurance product What are the top 5 areas to advertise a new insurance product? SELECT zipcode, AVG(income*w1+age*w2+credit*w3) as score FROM customer WHERE occupation=‘student’ GROUP BY zipcode ORDER BY score LIMIT 5

5 Example 2: Finding the most profitable combinations What are the 5 most profitable pairs of (product category, sales area)? SELECT P.cateogy, S.zipcode, MID_SUM(S.price - P.manufact_price) as score FROM products P, sales S WHERE P.p_key=S.p_key GROUP BY P.category, S.zipcode ORDER BY score LIMIT 5

6 Ad-Hoc Ranking Ranking Condition : F=G(T) e.g. AVG (income*w1+age*w2+credit*w3) MID_SUM (S.price - P.manufact_price) G: group-aggregate function  Standard (e.g., sum, avg)  User-defined (e.g., mid_sum) T: tuple-aggregate function  arbitrary expression  e.g., AVG (income*w1+age*w2+credit*w3), w1, w2, w3 can be any values.

7 Why “Ad-Hoc”? DSS applications are exploratory and interactive: Decision makers try out various ranking criteria Results of a query as the basis for further queries It requires efficient techniques for fast response

8 Existing Techniques Data Cube / Materialized Views: pre-computation  The views may not be built for the G: e.g., mid_sum cannot be derived from sum, avg, etc.  The views may not be built for the T: e.g., a+b does not help in doing a*b, and vice versa. Materialize-Group-Sort: from the scratch

9 Materialize-Group-Sort Approach Select zipcode, AVG(income *w1+age*w2+credit*w3) as score From Customer Where occupation=‘student’ Group By zipcode Order By score Limit 5 B 5 results sorting R grouping G Boolean

10 Problems of Materialize-Group-Sort group sort (a) Traditional query plan. all tuples (materialized) all groups (materialized) Overkill: Total order of all groups, although only top 5 are requested. Inefficient: Full materialization (scan, join, grouping, sorting). Boolean operators

11 Can We Do Better? group sort (a) Traditional query plan. all tuples (materialized) all groups (materialized) Without any further info, full materialization is all that we can do. Can we do better:  What info do we need?  How to use the info? Boolean operators

12 RankAgg vs. Materialize-Group-Sort agg tuple x 1 tuple x 2 (b) New query plan. group g 1 group g 2 (incremental) rank- & group-aware operators Goal: minimize the number of tuples processed. (Partial vs. full materialization) group sort (a) Traditional query plan. all tuples (materialized) all groups (materialized) Boolean operators

13 Orders of Magnitude Performance Improvement

14 The Principles of RankAgg Can we do better? Upper-Bound Principle: best-possible goal There is a certain minimal number of tuples to retrieve before we can stop. What info do we need? Upper-Bound Principle: must-have info A non-trivial upper-bound is a must. (e.g., +infinity will not save anything.) Upper-bound of a group indicates the best a group can achieve, thus tells us if it is going to make top-k or not. How to use the info?  Group-Ranking Principle: Process the most promising group first.  Tuple-Ranking Principle: Retrieve tuples in a group in the order of T. Together: Optimal Aggregate Processing minimal number of tuples processed.

15 Running Example Select g, SUM(v) From R Group By g Order By SUM(v) Limit 1 TIDR.gR.v r1r1 1.7 r2r2 2.3 r3r3 3.9 r4r4 2.4 r5r5 1.9 r6r6 3.7 r7r7 1.6 r8r8 2.25

16 Must-Have Information Assumptions for getting a non-trivial upper-bound: We focus on a (large) class of max-bounded function: F[g] can be obtained by applying G over the maximal T of g’s members. We have the size of each group. (Will get back to this.) We can obtain the maximal value of T. (In the example, v <= 1.)

17 Example: Group-Ranking Principle TIDR.gR.v r1r1 1.7 r5r5 1.9 r7r7 1.6 TIDR.gR.v r3r3 3.9 r6r6 3.7 agg group-aware scan R TIDR.gR.v r2r2 2.3 r4r4 2.4 r8r Process the most promising group first.

18 Example: Group-Ranking Principle group-aware scan action initial R TIDR.gR.v r1r1 1.7 r5r5 1.9 r7r7 1.6 TIDR.gR.v r3r3 3.9 r6r6 3.7 TIDR.gR.v r2r2 2.3 r4r4 2.4 r8r Process the most promising group first. agg

19 Example: Group-Ranking Principle group-aware scan action initial (r 1, 1,.7) R TIDR.gR.v r5r5 1.9 r7r7 1.6 TIDR.gR.v r3r3 3.9 r6r6 3.7 TIDR.gR.v r2r2 2.3 r4r4 2.4 r8r Process the most promising group first. agg

20 Example: Group-Ranking Principle group-aware scan action initial (r 1, 1,.7) (r 2, 2,.3) R TIDR.gR.v r5r5 1.9 r7r7 1.6 TIDR.gR.v r3r3 3.9 r6r6 3.7 TIDR.gR.v r4r4 2.4 r8r Process the most promising group first. agg

21 Example: Group-Ranking Principle group-aware scan action initial (r 1, 1,.7) (r 2, 2,.3) (r 5, 1,.9) R TIDR.gR.v r7r7 1.6 TIDR.gR.v r3r3 3.9 r6r6 3.7 TIDR.gR.v r4r4 2.4 r8r Process the most promising group first. agg

22 Example: Group-Ranking Principle group-aware scan action initial (r 1, 1,.7) (r 2, 2,.3) (r 5, 1,.9) (r 7, 1,.6) R TIDR.gR.v TIDR.gR.v r3r3 3.9 r6r6 3.7 TIDR.gR.v r4r4 2.4 r8r Process the most promising group first. agg

23 Example: Group-Ranking Principle group-aware scan action initial (r 1, 1,.7) (r 2, 2,.3) (r 5, 1,.9) (r 7, 1,.6) (r 4, 2,.4) R TIDR.gR.v TIDR.gR.v r3r3 3.9 r6r6 3.7 TIDR.gR.v r8r Process the most promising group first. agg

24 Example: Tuple-Ranking Principle TIDR.gR.v r2r2 2.3 r4r4 2.4 r8r TIDR.gR.v r4r4 2.4 r2r2 2.3 r8r group & rank-aware scan R in the order of R.v not in the order of R.v Retrieve tuples within a group in the order of tuple-aggregate function T. agg

25 Example: Tuple-Ranking Principle TIDR.gR.v r2r2 2.3 r4r4 2.4 r8r TIDR.gR.v r4r4 2.4 r2r2 2.3 r8r group & rank-aware scan R in the order of R.v not in the order of R.v Retrieve tuples within a group in the order of tuple-aggregate function T. action initial3.0 action initial3.0 agg

26 Example: Tuple-Ranking Principle TIDR.gR.v r4r4 2.4 r8r TIDR.gR.v r2r2 2.3 r8r group & rank-aware scan R in the order of R.v not in the order of R.v Retrieve tuples within a group in the order of tuple-aggregate function T. action initial3.0 (r 2, 2,.3)2.3 action initial3.0 (r 4, 2,.4)1.2 agg

27 Example: Tuple-Ranking Principle TIDR.gR.v r8r TIDR.gR.v r8r group & rank-aware scan R in the order of R.v not in the order of R.v Retrieve tuples within a group in the order of tuple-aggregate function T. action initial3.0 (r 2, 2,.3)2.3 (r 4, 2,.4)1.7 action initial3.0 (r 4, 2,.4)1.2 (r 2, 2,.3)1.0 agg

28 Implementing the Principles: Obtaining Group Size Sizes ready: Though G(T) is ad-hoc, the Boolean conditions are shared in sessions of decision making. Sizes from materialized information: Similar queries computed. Sizes from scratch: Pay as much as materialize-group-sort for the 1 st query; amortized by the future similar queries.

29 Implementing the Principles: Group-Aware Plans Current iterator GetNext( ) operator New iterator GetNext(g) operator scan agg GetNext(g) GetNext(g’’) GetNext(g’)

30 Conclusions Ranking Aggregate Queries  Top-k groups  Ad-Hoc ranking conditions RankAgg  Principles Upper-Bound, Group-Ranking, and Tuple-Ranking  Optimal Aggregate Processing Minimal number of tuples processed  Significant performance gains, compared with materialize-group-sort.