RankSQL: Query Algebra and Optimization for Relational Top-k Queries

Slides:



Advertisements
Similar presentations
Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.
Advertisements

Web Information Retrieval
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
RankSQL: Supporting Ranking Queries in RDBMS Chengkai Li (UIUC) Mohamed A. Soliman (Univ. of Waterloo) Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (Univ.
1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC.
1 RankSQL: Query Algebra and Optimization for Relational Top-k Queries Chengkai Li (UIUC) joint work with Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (U.
The Volcano/Cascades Query Optimization Framework
CS 540 Database Management Systems
Evaluation of Relational Operators CS634 Lecture 11, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
University of Washington Database Group Tiresias The Database Oracle for How-To Queries Alexandra Meliou § ✜ Dan Suciu ✜ § University of Massachusetts.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Supporting Ad-Hoc Ranking Aggregates Chengkai Li (UIUC) joint work with Kevin Chang (UIUC) Ihab Ilyas (Waterloo)
1 EntityRank: Searching Entities Directly and Holistically Tao Cheng Joint work with : Xifeng Yan, Kevin Chang VLDB 2007, Vienna, Austria.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
FlexPref: A Framework for Extensible Preference Evaluation in Database Systems Justin J. Levandoski Mohamed F. Mokbel Mohamed E. Khalefa.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina.
Depth Estimation for Ranking Query Optimization Karl Schnaitter, UC Santa Cruz Joshua Spiegel, BEA Systems, Inc. Neoklis Polyzotis, UC Santa Cruz.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
1 Query Planning with Limited Source Capabilities Chen Li Stanford University Edward Y. Chang University of California, Santa Barbara.
CPS216: Advanced Database Systems Notes 03:Query Processing (Overview, contd.) Shivnath Babu.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
NiagaraCQ : A Scalable Continuous Query System for Internet Databases (modified slides available on course webpage) Jianjun Chen et al Computer Sciences.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
CSE 636 Data Integration Limited Source Capabilities Slides by Hector Garcia-Molina Fall 2006.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Querying Structured Text in an XML Database By Xuemei Luo.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Boolean + Ranking: Querying a Database by K-Constrained Optimization Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi.
1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.
Supporting Ranking and Clustering as Generalized Order-By and Group-By
Boolean + Ranking: Querying a Database by K-Constrained Optimization
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
Supporting Ranking in Queries Score-based Paradigm
Seung-won Hwang, Kevin Chen-Chuan Chang
Supporting Ad-Hoc Ranking Aggregates
Prepared by : Ankit Patel (226)
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Preference Query Evaluation Over Expensive Attributes
Junqi Zhang+ Xiangdong Zhou+ Wei Wang+ Baile Shi+ Jian Pei*
Renouncing Hotel’s Data Through Queries Using Hadoop
Query Optimization CS 157B Ch. 14 Mien Siao.
Probabilistic Databases
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Completing the Physical-Query-Plan and Chapter 16 Summary ( )
Efficient Processing of Top-k Spatial Preference Queries
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

RankSQL: Query Algebra and Optimization for Relational Top-k Queries Chengkai Li (UIUC) joint work with Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (U. of Waterloo) Sumin Song (UIUC)

Ranking (Top-k) Queries Ranking is an important functionality in many real-world database applications: E-Commerce, Web Sources Find the best hotel deals by price, distance, etc. Multimedia Databases Find the most similar images by color, shape, texture, etc. Text Retrieval, Search Engine Find the most relevant records/documents/pages. OLAP, Decision Support Find the top profitable customers to send ads.

Example: Trip Planning Suggest a hotel to stay and a museum to visit: hotel museum cheap close related score h1 m2 0.9 0.7 0.8 2.4 h2 m1 0.6 2.3 m3 2.2 Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area Order By cheap(h.price) + close(h.addr, “BWI airport”) + related(m.collection,“dinosaur”) Limit 5 membership dimension: Boolean predicates, Boolean function B order dimension: ranking predicates, monotonic scoring function R

Processing Ranking Queries in Traditional RDBMS 5 results sort cheap+close+related R Select * From Hotel h, Museum m Where B Order By R Limit 5 B h.area=m.area σh.star=3 scan(h) scan(m)

Problems of Traditional Approach Naïve Materialize-then-Sort scheme Overkill: total order of all results; only 5 top results are requested. Very inefficient: Scan large base tables; Join large intermediate results; Evaluate every ranking on every tuple; Full sorting. R 5 results B

Therefore the problem is: Unlike Boolean constructs, ranking is second class. Ranking is processed as a Monolithic component (R ), always after the Boolean component (B).

How did we make Boolean “first class”? Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area σ h.star=3 AND h.area=m.area Hotel X Museum (1) materialize-then-filter

First Class: Splitting and Interleaving Select * From Hotel h, Museum m Where h.star=3 AND h.area=m.area σ h.star=3 AND h.area=m.area Hotel X Museum (1) materialize-then-filter h.area=m.area σh.star=3 scan(h) scan(m) (2) B is split into joins and selections, which interleave with each other.

sort cheap+close+related Ranking Query Plan 5 results sort cheap+close+related R 5 results rank-joinh.area=m.area σh.star=3 rank-scan(h,cheap) scan(m) μrelated μclose B h.area=m.area σh.star=3 scan(h) scan(m) split and interleave: reduction of intermediate results, thus processing cost materialize-then-sort: naïve, overkill

Possibly orders of magnitude improvement Implementation in PostgreSQL plan1: traditional materialize-then-sort plan plan2-4: new ranking query plans Observations: an extended plan space with plans of various costs.

RankSQL Goals: Support ranking as a first-class query type in RDBMS; splitting ranking. Integrate ranking with traditional Boolean query constructs. interleaving ranking with other operations. Foundation: Rank-Relational Algebra data model: rank-relation operators: new and augmented algebraic laws Query engine: executor: physical operator implementation optimizer: plan enumeration, cost estimation

Two Logical Properties of Rank-Relation Membership of the tuples: evaluated Boolean predicates Order among the tuples: evaluated ranking predciates rank-relation Membership(B): h.area=m.area, h.star=3 Order(R): close(h.addr, “BWI airport”) related(m.collection,“dinosaur”) cheap(h.price) B Membership(B): h.star=3 Order(R): cheap(h.price) A rank-relation

Ranking Principle: what should be the order? F=cheap + close + related hotel cheap h1 0.9 h2 0.6 … upper bound museum close related 2.9 * 1 2.6 … B tuple h1 tuple h2 Upper-bound determines the order: Without further processing h1, we cannot output any result; A

Ranking Principle: upper-bound determines the order F=cheap + close + related hotel cheap h1 0.9 h2 0.6 … upper bound museum close related 2.9 * 1 2.6 … B tuple h1 tuple h2 Upper-bound determines the order: Without further processing h1, we cannot output any result; A Processing in the “promising” order, avoiding unnecessary processing.

Rank-Relation Rank-relation R: relation F: monotonic scoring function over predicates (p1, ..., pn) P  {p1, . . . , pn}: evaluated predicates Logical Properties: Membership: R (as usual) Order: <  t1, t2  : t1 < t2 iff FP [t1] < FP [t2]. (by upper-bound)

Operators To achieve splitting and interleaving: New operator: μ: evaluate ranking predicates piece by piece. implementation: MPro (Chang et al. SIGMOD02). Extended operators: rank-selection rank-join implementation: HRJN (Ilyas et al. VLDB03). rank-scan rank-union, rank-intersection.

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 upper-bound hotel hotel … p1 p2 p3 score h1 0.7 0.8 0.9 2.4 h2 0.85 2.55 h3 0.5 0.45 0.75 1.7 h4 0.4 0.95 2.05 Scanp1(H) μp2 μp3 Rp1+p2 Select * From Hotel H Order By p1+p2+p3 Limit 1 Rp1

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound hotel … p1 p2 p3 score h1 h2 0.9 h3 h4 hotel upper-bound Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound hotel … p1 p2 p3 score h1 0.9 1.0 2.9 h2 h3 h4 hotel upper-bound Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound hotel … p1 p2 p3 score h1 0.9 1.0 2.9 h2 h3 h4 hotel upper-bound h2 2.9 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound hotel … p1 p2 p3 score h1 0.9 1.0 2.9 h2 0.85 2.75 h3 h4 hotel upper-bound h2 2.75 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound hotel … p1 p2 p3 score h1 0.7 1.0 2.7 h2 0.9 0.85 2.75 h3 h4 hotel upper-bound h2 2.75 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9 h1 2.7

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound hotel … p1 p2 p3 score h1 0.7 0.8 1.0 2.5 h2 0.9 0.85 2.75 h3 2.7 h4 hotel upper-bound h2 2.75 h1 2.5 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9 h1 2.7

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound h2 2.55 hotel … p1 p2 p3 score h1 0.7 0.8 1.0 2.5 h2 0.9 0.85 2.55 h3 2.7 h4 hotel upper-bound h2 2.75 h1 2.5 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9 h1 2.7

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound h2 2.55 hotel … p1 p2 p3 score h1 0.7 0.8 1.0 2.5 h2 0.9 0.85 2.55 h3 0.5 h4 hotel upper-bound h2 2.75 h1 2.5 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9 h1 2.7 h3 2.5

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound h2 2.55 hotel … p1 p2 p3 score h1 0.7 0.8 1.0 2.5 h2 0.9 0.85 2.55 h3 0.5 0.45 1.95 h4 hotel upper-bound h2 2.75 h1 2.5 h3 1.95 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9 h1 2.7 h3 2.5

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound h2 2.55 h1 2.4 hotel … p1 p2 p3 score h1 0.7 0.8 0.9 2.4 h2 0.85 2.55 h3 0.5 0.45 1.0 1.95 h4 2.5 hotel upper-bound h2 2.75 h1 2.5 h3 1.95 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9 h1 2.7 h3 2.5

Example μp3 μp2 Rp1+p2+p3 Rp1+p2 Select * From Hotel H Rp1 Scanp1(H) μp2 μp3 Rp1 Rp1+p2 Rp1+p2+p3 hotel upper-bound h2 2.55 h1 2.4 hotel … p1 p2 p3 score h1 0.7 0.8 0.9 2.4 h2 0.85 2.55 h3 0.5 0.45 1.0 1.95 h4 2.5 hotel upper-bound h2 2.75 h1 2.5 h3 1.95 Select * From Hotel H Order By p1+p2+p3 Limit 1 hotel upper-bound h2 2.9 h1 2.7 h3 2.5

In contrast: materialize-then-sort hotel … p1 p2 p3 score h1 0.7 0.8 0.9 2.4 h2 0.85 2.55 h3 0.5 0.45 0.75 1.7 h4 0.4 0.95 2.05 …. Sortp1+p2+p3 Select * From Hotel H Order By p1+p2+p3 Limit 1 Scan(H)

Impact of Rank-Relational Algebra parser, rewriter Ranking Query operators and algebraic laws Logical Query Plan cost model Plan enumerator optimizer Physical Query Plan implementation of operators executor Results

Optimization Two-dimensional enumeration: ranking (ranking predicate scheduling) and filtering (join order selection) Sampling-based cardinality estimation

Two-Dimensional Enumeration (1 table, 0 predicate) seqScan(H), idxScan(H), seqScan(M), … (1 table, 1 predicate) rankScancheap(H), cheap(seqScan(H)), … (1 table, 2 predicates) close(rankScancheap(H)), … (2 table, 0 predicate) NestLoop(seqScan(H), seqScan(M)), … (2 table, 1 predicate) NRJN(rankScancheap(H), seqScan(M)),… and so on…

Related Work Middleware Fagin et al. (PODS 96,01),Nepal et al. (ICDE 99),Günter et al. (VLDB 00),Bruno et al. (ICDE 02),Chang et al. (SIGMOD 02) RDBMS, outside the core Chaudhuri et al. (VLDB 99),Chang et al. (SIGMOD 00), Hristidis et al. (SIGMOD 01), Tsaparas et al. (ICDE 03), Yi et al. (ICDE 03) RDBMS, in the query engine Physical operators and physical properties Carey et al. (SIGMOD 97), Ilyas et al. (VLDB 02, 03, SIGMOD 04), Natsev et al. (VLDB 01) Algebra framework Chaudhuri et al. (CIDR 05)

Conclusion: RankSQL System Goal: Support ranking as a first-class query type; Integrate ranking with Boolean query constructs. Our approach: Algebra: rank-relation, new and augmented rank-aware operators, algebraic laws Optimizer: two-dimensional enumeration, sampling-based cost estimation Implementation: in PostgreSQL Welcome to our demo in VLDB05!