A Framework for Testing Query Transformation Rules

Slides:



Advertisements
Similar presentations
1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.
Advertisements

Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
The Volcano/Cascades Query Optimization Framework
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
DB performance tuning using indexes Section 8.5 and Chapters 20 (Raghu)
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
Query Optimization. Query Optimization Query Optimization The execution cost is expressed as weighted combination of I/O, CPU and communication cost.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Query Processing and Optimization
© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Bhanu Pratap Gupta Devang Vira S. Sudarshan Dept. of Computer Science and Engineering, IIT Bombay.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
CHAPTER 19 Query Optimization. CHAPTER 19 Query Optimization.
Query Processing and Optimization, and Database Tuning
CPS216: Data-intensive Computing Systems
CS 540 Database Management Systems
An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server
Database Management System
Rule Induction for Classification Using
Priority Queues An abstract data type (ADT) Similar to a queue
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
A paper on Join Synopses for Approximate Query Answering
Query Optimization Kush Kashyap B.Tech -IT.
ITD1312 Database Principles Chapter 5: Physical Database Design
RE-Tree: An Efficient Index Structure for Regular Expressions
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Query Optimization for Object-Relational Database Systems
Chapter 12: Query Processing
Overview of Query Optimization
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
Chapter 15 QUERY EXECUTION.
CS222: Principles of Data Management Notes #13 Set operations, Aggregation Instructor: Chen Li.
Introduction to Database Systems
Solving problems by searching
On Efficient Graph Substructure Selection
Automatic Physical Design Tuning: Workload as a Sequence
Physical Database Design
Objective of This Course
CSc4730/6730 Scientific Visualization
Database Applications (15-415) DBMS Internals- Part IX Lecture 21, April 1, 2018 Mohammad Hammoud.
Query Optimization CS 157B Ch. 14 Mien Siao.
IPOG: A General Strategy for T-Way Software Testing
(A Research Proposal for Optimizing DBMS on CMP)
Lecture 2- Query Processing (continued)
Advance Database Systems
Priority Queues An abstract data type (ADT) Similar to a queue
Implementation of Relational Operations
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Query Optimization.
Monday, 5/13/2002 Hash table indexes, query optimization
Solving problems by searching
Yan Huang - CSCI5330 Database Implementation – Query Processing
Modeling and Analysis Tutorial
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Solving problems by searching
Presentation transcript:

A Framework for Testing Query Transformation Rules Hicham Elmongui Purdue University Vivek Narasayya, Ravi Ramamurthy Microsoft Research 4/10/2019 ACM SIGMOD 2009

Query Optimizer Database System Optimizer Responsible for producing a good execution plan for a given SQL query Crucial for decision support queries

Query Optimizer Components Search Strategy Rule Engine Apply rule Query Execution Plan Cost Model Cardinality Estimation Query Optimizer 4/10/2019 ACM SIGMOD 2009

Query Transformation Rules Apply Join Associativity Rule Logical Rule R S Apply Join To Hash Join Rule Hash Join Implementation Rule Search space extensible by adding new rules Group By, De-correlation, Star Join, etc. Modern optimizers have large number of rules 4/10/2019 ACM SIGMOD 2009

Implementing Rule Engine is Non-Trivial SELECT D.Name FROM DEPT D WHERE D.BUDGET <= ( SELECT COUNT(E.eno)*10000 FROM E WHERE E.Dno = D.Dno) SELECT D.Name FROM DEPT D , EMP E WHERE D.no = E.Dno GROUP BY D.Name HAVING D.Budget <= COUNT(E.Eno)*10000 Count Bug in De-correlation Rewrite rules can be subtle Implementation errors can lead to incorrect results RAGS paper (VLDB’98) 4 DBMSs disagreed on query results 16% of the time! 4/10/2019

Testing Optimizer Rule Engine Coverage Is a given rule (or set of rules) exercised? Correctness Does exercising a rule (or set of rules) change the query results? Performance How does a rule (or set of rules) affect query performance? 4/10/2019 ACM SIGMOD 2009

Rule Coverage Definitions of when a rule is exercised Query Transformation rules exercised API to track which rules are exercised for a given query Q1 1 2 3 4 5 … n Q2 1 2 3 4 5 … n … Qm 1 2 3 4 5 … n Definitions of when a rule is exercised Rule must generate at least one expression during optimization At least one expression in the final plan must be generated by rule 4/10/2019 ACM SIGMOD 2009

Testing Rule Coverage Generate query such that each rule is exercised Hard to precisely characterize when a rule will be exercised Depends on rule semantics, optimizer heuristics etc. Extend for a set of rules (e.g. rule pairs) Large space of combinations Efficient query generation Time required to generate query that exercises rule should be as small as possible Need multiple queries per rule (or set of rules) Random query generation can be inefficient 4/10/2019 ACM SIGMOD 2009

Rule Correctness R ≠R΄ bug Query Q Disable rule r2 Query Q Results R Transformation rules exercised Plan P Query Q 1 2 3 4 5 … n Optimize Execute Disable rule r2 Results R΄ Plan P΄ Query Q 1 2 3 4 5 … n Optimize Execute R ≠R΄ bug 4/10/2019 ACM SIGMOD 2009

Testing Rule Correctness Transformation rules exercised Plan P Query Q 1 2 3 4 5 … n Optimize Disable rule r2 Disable rule rn-1 Disable rule r3 Plan P2 Plan Pn-1 Plan P2 For each rule, repeat for multiple such queries (k) Need to execute if P ≠ P΄ Queries are usually complex Equivalence of plan P and P΄cannot be inferred in most cases Time consuming 4/10/2019 ACM SIGMOD 2009

DBMS Testing Data Generation Query Generation Quickly generating Billion-Record databases (SIGMOD’94) Flexible Database Generators (VLDB’05) Reverse Query Processing (ICDE’07) MUDD: A Multi-dimensional data generator(WOSP’04) Query Generation RAGS (VLDB’98) Generating Thousand Benchmark Queries in Seconds (VLDB’04) Genetic approach (VLDB’07) Unit testing query transformation rules (DBTest’08) Generating queries with cardinality constraints (TKDE’o6, SIGMOD’08) 4/10/2019 ACM SIGMOD 2009

Query Generation for Rule Testing RAGS (VLDB’98) Stochastic SQL statement generation Control SQL generated via configuration parameters #Joins, #columns in Group-By, max sub-query depth, … Genetic approach (VLDB’07) Queries are mutated, combined, etc. to generate new queries Feedback function applied on each query to determine “fitness” E.g. prefer queries with non-empty results 4/10/2019 ACM SIGMOD 2009

Our Contributions Query generation Correctness validation Exploit “rule patterns” to identify necessary condition for a rule to be exercised Significantly reduces number of trials compared to previous approaches Correctness validation Novel problem of test suite compression Significantly reduce time for correctness testing Shown to be NP-Hard Principled solution (factor 2 approximation) 4/10/2019 ACM SIGMOD 2009

QRel Framework QREL: (DBTest’08) Programming framework for generating queries Generate logical query tree from tree “pattern” Generate SQL from a given logical query tree 4/10/2019 ACM SIGMOD 2009

Architecture 4/10/2019 ACM SIGMOD 2009

Rule Patterns Rule  (Rule Name, Rule Pattern, Substitution) Input expression e If e matches Rule Pattern Generate new expression by invoking Substitution function on e Apply rule R S T Rule Pattern for Join Commutativity R S T 4/10/2019 ACM SIGMOD 2009

Exposing Rule Patterns Idea: Optimizer exposes a Rule Pattern for a given rule Returns (a subset of) necessary conditions for rule to be exercised Encoded using XML in our implementation Query Optimizer DBMS “Join Commutativity” Query Generation Tool 4/10/2019 ACM SIGMOD 2009

Rule Interactions Bugs in implementation of one rule may manifest when another rule is also applied “Get to Index Scan” rule Index Scan I (a, d) Get S “Join to Merge Join” rule Merge Join Get R R.a = S.b Index Scan I (d, a) “Get to Index Scan” rule Get S “Join to Merge Join” rule Merge Join Get R R.a = S.b ACM SIGMOD 2009 4/10/2019

Rule Composition Rule Pattern for Pulling GB above Join Group-By Rule Pattern for Join Commutativity Wildcard Combine rule patterns by replacing a wildcard node with the other rule pattern Other kinds of composition possible as well Group-By Group-By Group-By Group-By 4/10/2019 ACM SIGMOD 2009

Query Generation Algorithm For each rule pair (r1,r2) Select a composition of rule patterns T = Generate logical query tree for rule pattern S = Generate SQL statement for T // use QREL Repeat if r1 and r2 not exercised when S is optimized T2 T3 Group-By T1 Group-By SELECT T3.a, … FROM T1, T2, T3 WHERE … GROUP BY T3.a, … 4/10/2019 ACM SIGMOD 2009

Experiments Number of trials significantly fewer using Rule Patterns 12x reduction in number of trials for rule pairs 4/10/2019 ACM SIGMOD 2009

Test Suite Compression 110 100 r1 Q1 130 Baseline Cost = 100 + 150 + 300 + 110 + 160 + 400 = 1220 r2 160 Q2 150 500 r3 400 Q3 300 Find sub-graph of bipartite graph such that Each rule is selected Degree of each rule node is equal to test suite size (k) Sum of the edge costs is minimized Problem is NP-Hard (reduction from Set Cover problem) 4/10/2019 ACM SIGMOD 2009

Set Cover Heuristic Benefit(Q) = Number of new rules exercised/ Cost(Q) Greedily add query with largest “Benefit” Add edges corresponding to Q 110 100 r1 Q1 Benefit(Q1) = 3/100 Benefit(Q2) = 1/150 Benefit(Q3) = 1/200 130 r2 160 Q2 150 500 r3 400 Q3 300 Total Solution Cost = 100 + 110 + 130+ 500 = 840 Key drawback: ignores edge costs Turning off a rule can significantly plan cost 4/10/2019 ACM SIGMOD 2009

Top K Independent Algorithm For each rule r, add k edges with the lowest cost Factor 2 approximation of the optimal Ignores node cost 110 100 r1 Q1 130 r2 160 Q2 150 500 r3 400 Q3 300 Total solution cost = 100 + 150+ 110 + 130 + 160 = 650 In practice much better than alternatives 4/10/2019 ACM SIGMOD 2009

Experiments Top K Independent is significantly better Even better for case of rule pairs Further optimizations, experiments in paper 4/10/2019 ACM SIGMOD 2009

Conclusion Testing query optimizer rule engine is important Query generation for rule testing Significant gains by exploiting rule patterns Correctness validation Dramatic reductions possible using test suite compression Many open problems in rule testing Other variants of “rule exercising” Other kinds of rule interactions Data generation to ensure other necessary conditions (e.g. star join optimization rule requires FK relationship) 4/10/2019 ACM SIGMOD 2009