Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.

Slides:



Advertisements
Similar presentations
Query optimisation.
Advertisements

Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Complexity Analysis (Part II)
Choosing an Order for Joins
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
CS4432: Database Systems II
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.
BA 555 Practical Business Analysis
Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.
CS107 Introduction to Computer Science
Complexity Analysis (Part I)
Hierarchical Constraint Satisfaction in Spatial Database Dimitris Papadias, Panos Kalnis And Nikos Mamoulis.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
 Last lesson  Arrays for implementing collection classes  Performance analysis (review)  Today  Performance analysis  Logarithm.
Analysis of Algorithms 7/2/2015CS202 - Fundamentals of Computer Science II1.
Experimental Evaluation
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Analysis of Algorithms Spring 2015CS202 - Fundamentals of Computer Science II1.
Normalised Least Mean-Square Adaptive Filtering
Elements of the Heuristic Approach
by B. Zadrozny and C. Elkan
SEARCHING, SORTING, AND ASYMPTOTIC COMPLEXITY Lecture 12 CS2110 – Fall 2009.
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
Iterative Algorithm Analysis & Asymptotic Notations
Access Path Selection in a Relational Database Management System Selinger et al.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.
Analysis of Algorithms
Complexity of Algorithms
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Histograms for Selectivity Estimation
Learning by Simulating Evolution Artificial Intelligence CSMC February 21, 2002.
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-
Experimental Algorithmics Reading Group, UBC, CS Presented paper: Fine-tuning of Algorithms Using Fractional Experimental Designs and Local Search by Belarmino.
Random Interpretation Sumit Gulwani UC-Berkeley. 1 Program Analysis Applications in all aspects of software development, e.g. Program correctness Compiler.
2-0 Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin “ Introduction to the Design & Analysis of Algorithms, ” 2 nd ed., Ch. 2 Theoretical.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Analysis of Algorithms Spring 2016CS202 - Fundamentals of Computer Science II1.
CHAPTER 19 Query Optimization. CHAPTER 19 Query Optimization.
Complexity Analysis (Part I)
Data Transformation: Normalization
Introduction to Analysis of Algorithms
Heuristic Optimization Methods
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Analysis of algorithms
Chapter 2 Fundamentals of the Analysis of Algorithm Efficiency
Machine Learning Feature Creation and Selection
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Chapter 2.
At the end of this session, learner will be able to:
Analysis of algorithms
Estimating Algorithm Performance
Complexity Analysis (Part I)
Complexity Analysis (Part I)
Presentation transcript:

Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri

Problem Statement Given Queries with Parametric filters, find values of Parameters so that cardinality constraints are satisfied on a given fixed database Given Queries with Parametric filters, find values of Parameters so that cardinality constraints are satisfied on a given fixed database Constraints: Cardinality constraints on the query and its subexpressions. query and its subexpressions. Parameters: Simple filters.

Example Select * from testR where Select * from testR where ( testR.v1 between %f and %f) : 100,000 ( testR.v1 between %f and %f) : 100,000 Select * from testS where Select * from testS where ( testS.v1 <= %f): 17,000 ( testS.v1 <= %f): 17,000 Select * from testR, testS where (testR.v1=testS.v0) and ( testS.v1 <= %f) Select * from testR, testS where (testR.v1=testS.v0) and ( testS.v1 <= %f) and ( testR.v0 >= %f) and ( testR.v0 >= %f) and ( testR.v1 between %f and %f): 30,000 and ( testR.v1 between %f and %f): 30,000

Motivation Generation of queries to test the optimizer. RAGS tool is available presently to syntactically generate random queries and test for errors by a majority vote.

Motivation Needed to test different modules, new algorithms, test statistics estimator, and compare performances Queries not random but you want them to satisfy some constraints

Solution exists? NP complete. For n parametric attributes with Joins For n parametric attributes with Joins Database only has O(n) tuples Database only has O(n) tuples Reduction from SUBSET SUM even for a single constraint.

Model For a given set of parameters can find the cardinality by a function invocation. For a given set of parameters can find the cardinality by a function invocation. Implemented by:  Actually running the query (slow, accurate)  Using optimizer estimates about the cardinality (fast, inaccurate)  Using an intermediate datastructure. Objective: Minimize the number of cardinality estimation calls

Understanding the Problem: Simplification K single sided <= attribute parameters K single sided <= attribute parameters Single relation and single constraint Single relation and single constraint Let n=number of distinct values in each attribute. k= number of attributes k= number of attributes Simple algorithm of time: Can we do better? 1 Dimension: Yes, Binary search. 1 Dimension: Yes, Binary search.

Results: Dimension Upper Bound Lower Bound 1 Log n 2nn K>=2

2 Dimension Algorithm Walk based Algorithm Walk based Algorithm Search for 20 Search for

Lower Bound Incomparable set

For general k. Upper bound: For k-dimensions, recursively call n invocations of (k-1) dimension algorithm. T(k)=n * T(k-1) T(2)=n Hence T(K)= (Multiple walk algorithm) Lower bound: x_1 + x_2 + … x_k = n Solutions C(n+k-1,k-1)

Optimization Problem: Error Metrics. Single Constraint: Single Constraint: Constraint cardinality: C, Constraint cardinality: C, Achieved cardinality: D Achieved cardinality: D RelErr= max (C/D, D/C) Multiple Constraints: Combing the errors: Average relative error across all constraints. Objective: Minimize error

Simple Walk STEP= unit change in current parameter values While (can improve with step) {Make the improving step} {Make the improving step} Stepsize=1 tuple->converges to local optima Stepsize small -> convergence slow

Simple Walk-> Halving Walk Initialize the parameters (point). Initialize the parameters (point). Each stepsize=1.0 quantile Each stepsize=1.0 quantile For (int i=0; i< maxhalve; i++) For (int i=0; i< maxhalve; i++) { while (can improve with step) { while (can improve with step) {Make the improving step} {Make the improving step} //exited above loop -> cannot improve with local //exited above loop -> cannot improve with local Halve all step sizes. Halve all step sizes. } Use quantiles to decide steps.

Halving Walk Initializing the parameters [More later] Initializing the parameters [More later] Steps made in quantile domain of attribute Steps made in quantile domain of attribute done by simple equidepth wrapper over histograms provided by SQLServer Initial stepsize=1.0 quantile

Halving Walk: Steps considered For = parameters: For = parameters: RIGHT move,LEFT move For between parameters: Apart from RIGHT move, LEFT move for each parameter. LEFT Translate. RIGHT Translate

Algorithm Halving-Steps A generalization of binary search // But only a heuristic. // But only a heuristic. Converges to Local Optima #Steps per iteration : Constant. Hence much faster convergence.

Initialization Random Random Optimizer estimate Optimizer estimate Solving equations: Solving equations: Power method. Power method. Least Square Error. Least Square Error.

Least Squares Initialization For each parametric attribute, Pi, have variable pi For each Constraint build an equation: Cardinality without parametric filters: C Cardinality without parametric filters: C Constraint cardinality with filters: F Constraint cardinality with filters: F Then Filter selectivity= S = F/C If P1, P2, P3, Pk are parameters in this constraint Write equation: p1 * p2 *.. pk = S (Making Independence assumption) (Making Independence assumption)

Least Squares Initialization In log space: set of linear equations. May have single, multiple or no solutions! May have single, multiple or no solutions! Use the solution that minimizes the least squares error metric. As in log-space this amounts to minimizing sum (L_2) of relative error. Simple and Fast Initialization.

Why still INIT step=1.0 quantile? Big Jumps in algorithm inspite of good start point: Optimizer estimates and independence assumptions may not be valid in the presence of correlated columns.

Efficiency: Statistics vs Execution Optimizer used for cardinality estimation Optimizer used for cardinality estimation but Executor used to verify the final step taken. but Executor used to verify the final step taken. For a step when Optimizer (esimates decrease) and executor (evaluates increase) disagree switch to using only executor for cardinality estimation. Good initialization obviates Optimizer use. Good initialization obviates Optimizer use.

Shortcutting Traverse parameters in random order Make the first step that decreases the error (Compare to previous approach of trying all steps and making the “best” step that decreases error most) No significant benefit. Shortcutting doesn’t seem to help. Infact sometimes slower No significant benefit. Shortcutting doesn’t seem to help. Infact sometimes slower convergence. convergence.

Experimental Results Dataset description: tables testR, testS, tesT, tableTA with upto 1M tuples. Have correlated columns and multiple correlated foreign key join columns. Columns include different Zipfian(1,0.5) and Gaussian distributions. Queries description: Queries join over correlated columns and have multiple correlated selectivities.

Query Description: Eg1: 6 Correlated parameters, 1 constraint. Single relation. Eg 2: 3 tables with 6 constraints including 2 way and 3 way join constraints. Filters on correlated columns across joins Other Queries with constraints over joins, many parameters over correlated attributes.

ERROR vs TIME graph

Problem Specifics: Reusing Results Lots of queries with the same skeleton Lots of queries with the same skeleton but different parameters. but different parameters. Creation of Indices will help! Use DTA for recommendations fold improvement in speed fold improvement in speed.

Using the DTA for index creation

Interleaving OPT and Exec Using Optimizer to guide search: gives Using Optimizer to guide search: gives 2-10 times improvement times improvement. Most of this improvement is also got by a good initialization procedure.

Shortcutting

Prune Search Look at only those steps that decrease the error Look at only those steps that decrease the error If present query has larger cardinality If present query has larger cardinality than constraint only make the filters than constraint only make the filters less selective % improvement % improvement.

Pruning Search

Initial Point Random: Random: Random may not converge to global optima Random may not converge to global optima Convergence much slower. LSE/Power: Usually converge to global optima. Much faster convergence. LSE/Power: Usually converge to global optima. Much faster convergence. Esp in 6 parameter query. Does not converge to global optima. Gets stuck up.

Multiple start points Searches from start points do not give Searches from start points do not give global optima global optima In practice a few start points gives the In practice a few start points gives the global optima global optima

Problem Summary Create query for testing a module Create query for testing a module Query not random but must satisfy some constraints. Must satisfy Cardinality constraints given the freedom to select some parametric filters.

Algorithm: Summary Theoretical walk based algorithm. Theoretical walk based algorithm. Halving search good in practice. Halving search good in practice. Use Use good initialization (optimizer, executor mix) good initialization (optimizer, executor mix) pruning pruning DTA indices. DTA indices. Cost: That of query executions, optimizer calls. Cost: That of query executions, optimizer calls.