Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
EXECUTION PLANS By Nimesh Shah, Amit Bhawnani. Outline  What is execution plan  How are execution plans created  How to get an execution plan  Graphical.
Ι.Β -- Εκτέλεση Ερωτήσεων και ΒελτιστοποίησηΣελίδα 4.40 Κεφάλαιο 9 Επεξεργασία και Βελτιστοποίηση Ερωτήσεων σε Σχεσιακές Βάσεις Δεδομένων.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Query Rewrite: Predicate Pushdown (through grouping) Select bid, Max(age) From Reserves R, Sailors S Where R.sid=S.sid GroupBy bid Having Max(age) > 40.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
Evaluating Hypotheses
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 9 Functional Testing
1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.
The query processor does what the query plan tells it to do A “good” query plan is essential for a well- performing.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Flexible Database Generators Nicolas Bruno Surajit Chaudhuri DMX Group Microsoft Research VLDB’05.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
PHP meets MySQL.
Access Path Selection in a Relational Database Management System Selinger et al.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Massive Stochastic Testing of SQL Don Slutz Microsoft Research Presented By Manan Shah.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Simulation is the process of studying the behavior of a real system by using a model that replicates the system under different scenarios. A simulation.
1 Algorithms  Algorithms are simply a list of steps required to solve some particular problem  They are designed as abstractions of processes carried.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 5 Index and Clustering
Mostafa Elhemali Leo Giakoumakis. Problem definition QRel system overview Case Study Conclusion 2.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Meta Data Cardinality Explored CSSQLUG User Group - June 2009.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
Teradata Join Processing
COMP 430 Intro. to Database Systems
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
File Processing : Query Processing
On Spatial Joins in MapReduce
Execution Plans Demystified
SQL Server Query Plans Journeyman and Beyond
Implementation of Relational Operations
Evaluation of Relational Operations: Other Techniques
A Framework for Testing Query Transformation Rules
Presentation transcript:

Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP

Space of possible queries, schemas, data is practically infinite 2 Possible schemas Possible queries Possible data Innumerable plans are possible for any query in this space.

3 Possible designs Possible queries Possible data Problem: how to construct an effective and economical query optimizer / executor regression suite? If a new plan is now produced for a given query, is its performance worse than before? (Optimizer regression) If the same plan is produced, is its performance worse? (Executor regression)

Goal: Building the regression test suite should…  … cover the plan space (code coverage is not enough)  … be economical not only for the optimizer but also for the execution engine  … require little knowledge of the optimizer (e.g., without hiring an expensive expert) 4

5 Possible designs Possible queries Possible data Historically, test suite focus has been on known workloads Customer workloads benchmarks known problems Developers and QA add “point” test cases Test specific optimization rules Improve code coverage May be very complex (large search space)

Disadvantages of historical approach  Can’t quantify/evaluate benefit added by new tests  When a regression occurs (especially if query is complex), hard to find the exact cause; Need an human expert to find and fix problem  Time to run increases with added tests  QA groups use code coverage as metric but this does not guarantee plan coverage 6

Insight  Enumerate the Optimizer Plan Space (OPS) over physical operators (PO)  Target OPS (PO, n): All possible legal plans with n operators chosen from PO PO: subset of physical operators used by DBMS that we want to test PO: subset of physical operators used by DBMS that we want to test n will have to be quite small (≤ 10) n will have to be quite small (≤ 10)  Goal: try to cover this finite space with as few queries as possible – a non trivial problem! 7

Difficult design decisions  How many table instances in each query? 4, 10, 20? Plan space grows exponentially 4, 10, 20? Plan space grows exponentially A 4 table query may be good enough for our purposes (regression testing the cost model) A 4 table query may be good enough for our purposes (regression testing the cost model) A 20 table query plan is composed of a series of 4 table queries A 20 table query plan is composed of a series of 4 table queries  How many physical operators per query? 3 successive hash joins behave very differently from 3 NL joins at exec. time 3 successive hash joins behave very differently from 3 NL joins at exec. time Regression suite should capture execution dependencies (e.g., intra-query resource contention) Regression suite should capture execution dependencies (e.g., intra-query resource contention) We settle for a small number of operators in each of our regression queries We settle for a small number of operators in each of our regression queries 8

Initial Experiment  Focus on join space only  Start with a simple skeleton query select T1.a select T1.a from T1, T2, T3, T4 from T1, T2, T3, T4 where T1.a = T2.a & T2.c = T3.c & T3.d = T4.d & where T1.a = T2.a & T2.c = T3.c & T3.d = T4.d & T1.b ≤ C1 & T2.b ≤ C2 & T3.b ≤ C3 & T4.b ≤ C4 T1.b ≤ C1 & T2.b ≤ C2 & T3.b ≤ C3 & T4.b ≤ C4 1 ≤ Ci ≤ |T|, 1 ≤ i ≤ 4 1 ≤ Ci ≤ |T|, 1 ≤ i ≤ 4  Column b is unique and exactly controls # of rows from each instance  All queries in the regression suite will have the same structure 9 T1T2T3T4

Assumption  5 different join methods: J1: cartesian product, J2: regular hash join (where one or both inputs have to be optionally partitioned on the fly), J3: small table broadcast hash join (here the smaller input is broadcast to all the sites/cpus of the bigger table), J4: merge join (one or both inputs may need to be sorted), and J5: index nested loops join. 10

Enumerating Join Plan Space  # of non bushy join plans = (4!)*5^3 = 3000 = subset of OPS ({J1, J2, J3, J4, J5}, 3) = subset of OPS ({J1, J2, J3, J4, J5}, 3)  Insight: We can do with just one table and use 4 instances of the same table  What is important is the number of rows from each table participating in each query (from the costing perspective) – not the order of the tables  Therefore the number of join plans of interest is 3,000/(4!) =

Structure of Table T (details in paper)  After some trial and error, we picked T with the following schema  T (int a, int b, int c, int d, [e, f, g]) a: primary clust. key, hash part. 8 ways on column a b: unique random secondary key c: Beta (4, 4): integer-valued, Normal-like over [ ] with mean = , std. dev = d: Beta (2, 0.5): integer-valued, Zipfian-like over [ ] with mean = , skew = e: uniformly distributed between 1 and 256 f: uniformly distributed between 1 and 4096 g: uniformly distributed between 1 and e: uniformly distributed between 1 and 256 f: uniformly distributed between 1 and 4096 g: uniformly distributed between 1 and

Generating the Regression Queries  Ultimately, we want 125 queries, each with a distinct join plan  We can try to generate them manually – very tedious; will require immense knowledge of the optimizer  Skeleton query lets us generate many queries with diverse plans ‘automatically’ by varying constants in predicates; No knowledge of optimizer required. No pain and lots of gain  Skeleton query lets us generate many queries with diverse plans ‘automatically’ by varying constants in predicates; No knowledge of optimizer required. No pain and lots of gain 13

Generating the Regression Queries (2)  Key idea: Vary the constants C1 thru C4 to ‘cover’ the 4 dimensional cardinality space  By varying the cardinalities of the participating tables we are essentially covering the cardinality space, which in turn will cover the plan space by letting the optimizer choose among the various join implementations  We chose each Ci from these 10 values {1, 10, 100, 1K,10K, 100K, 1M, 2M, 4M, 8M} {1, 10, 100, 1K,10K, 100K, 1M, 2M, 4M, 8M} resulting in 10^4 queries resulting in 10^4 queries  These queries were prepared and only 42 (< 35%) distinct plans were found (< 35%) distinct plans were found  Linear Skeleton query not good enough! Only one cartesian product possible; no star joins 14

Generalized Skeleton Query select T1.a from T T1, T T2, T T3, T T4 where T1.a = T2.a &T2.c = T3.c & T3.d = T4.d & T1.e = T3.e & T1.f = T4.f & T2.g = T4.g & T1.b ≤ C1 & T2.b ≤ C2 & T3.b ≤ C3 & T4.b ≤ C4  6 = (4 choose 2) join predicates  2^6 subsets of join predicates possible to allow for all join geometries (includes star joins, 0-3 CPs etc.)  Number of queries generated = 64*10K = 640,000  When these queries were compiled, # of distinct plans found was 101 (> 80%)  When these queries were compiled, # of distinct plans found was 101 (> 80%)  Picking the right skeleton query crucial 15 T1 T3T4 T2

Generating the Regression Suite  Pick a small number of queries corresponding to each distinct plan  How to pick these queries may be a good research topic  We picked 2 queries for each plan (‘closest’ and ‘farthest’ from the origin)  Intuition: These represent the two extremes for the plan  For a good starting point, verify cardinality estimates and optimality of plans in as many cases as possible. Possible because queries are simple. 16

Comparison with TPC-H/DS  Compiled queries with at least 4 tables in H and DS (turned off bushy plans)  Found all sequences of 3 consecutive joins in non bushy plans  Note: J1-J2-J3-J1-J4-J5 yields 4 sets of 3 joins  Even so, found 67 distinct plans (incrementally harder to generate additional plans manually)  Our suite was 50% better Hard to do manually  Key point: Our queries are very SIMPLE while DS queries are very COMPLEX (easy to diagnose and fix problems)  Our suite can be generated easily as well  Not suggesting that our suite replace H/DS 17

Conclusions  Use enumeration to generate a regression suite that covers the plan space – automatically  No knowledge of optimizer internals required  BUT need to come up with a good skeleton query  Validated approach by applying it to join plan space. 18

Future Work  How to extend to other SQL operators?  How to extend to larger number of tables (number of generated queries is ‘doubly exponential’)? If T = 5, generated queries = million  Will sampling work to reduce #of queries?  How to come up with a better metric that takes into account the quality of plans in the suite? 19