1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advertisements

Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Processing & Optimization
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
Access Path Selection in a Relational Database Management System Selinger et al.
CSCE Database Systems Chapter 15: Query Execution 1.
Database Management 9. course. Execution of queries.
Master Thesis Defense Jan Fiedler 04/17/98
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Query Optimization (CB Chapter ) CPSC 356 Database Ellen Walker Hiram College (Includes figures from Database Systems: An Application Oriented.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
Ad Hoc Constraints Objectives of the Lecture : To consider Ad Hoc Constraints in principle; To consider Ad Hoc Constraints in SQL; To consider other aspects.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter.
The Volcano Optimizer Generator Extensibility and Efficient Search.
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
CS 257 Chapter – 15.9 Summary of Query Execution Database Systems: The Complete Book Krishna Vellanki 124.
CS4432: Database Systems II Query Processing- Part 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
CS4432: Database Systems II Query Processing- Part 1 1.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 19, March 29, 2016 Mohammad Hammoud.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Database Applications (15-415) DBMS Internals- Part VIII Lecture 17, Oct 30, 2016 Mohammad Hammoud.
Query Optimization Heuristic Optimization
An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Examples of Physical Query Plan Alternatives
Automatic Physical Design Tuning: Workload as a Sequence
A Unifying View on Instance Selection
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Overview of Query Evaluation
Evaluation of Relational Operations: Other Techniques
A Framework for Testing Query Transformation Rules
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Presentation transcript:

1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research *Work done at Microsoft Research

2 Motivation Workload: Set of SQL Statements Many tasks exploit workload information –DB Admin, Index Tuning, Statistics building, Approximate Query Processing DBMS profilers produce large workloads (+additional info) Most tasks need small workloads Goal: Summarization - Find a “representative” subset of a given, large workload. –Sometimes a weighted subset

3 Why Not Random Sampling? One Size does not fit all –Different definitions of “representative subset” –Random sampling may lose valuable info Ignores additional info associated with statements Shown to work poorly, e.g., for Index Selection [chaudhuri02] –May oversample queries on some tables, while ignoring less frequent queries on other tables

4 Our Solution 1.Treat input as a relation Each SQL statement (+associated info) is a tuple 2.Extend SQL with new language primitives Allow declarative specification of desired subset Usable on arbitrary relations, not just workloads 3.Implement extensions inside query engine Why? Primitives appear widely applicable Other implementation options available

5 The Architecture SELECT *, DOMSUM(Count) FROM WkldTbl DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols (SLAVE.GroupByCols  MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX MASTER.OrderByCols) REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols MAXIMIZING SUM(DOM_Count) GLOBAL CONSTRAINT Count(*) ≤ 200 LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*)) Execution Engine Summary Application

6 Outline New Primitives for Summarization (Subsetting) –Dominance –Representation Implementing summarization primitives in SQL Experiments

7 Dominance Idea: Filter and aggregate using a partial order on tuples Specify condition for one tuple to dominate another –Transitive condition –Encapsulates application knowledge Output: Keep throwing away tuples that are dominated –Retain aggregate info about dominated tuples

8 A Graphical Representation Buono7525 Cattivo50 Vendor Quality Price

9 Applying Dominance to Workloads Example: Index Selection –An index useful for Q1 likely to be useful for Q2 SELECT... FROM R GROUP BY A, B, C SELECT … FROM R GROUP BY A, B dominates Q1 Q2 MASTER.FromTables=SLAVE.FromTables AND MASTER.GroupByCols  SLAVE.GroupByCols AND MASTER.OrderByCols PREFIX SLAVE.OrderByCols

10 Outline New Primitives for Summarization (Subsetting) –Dominance –Representation Implementing Summarization Primitives in SQL Experiments

11 Representation Dominance only gets us so far –Need a “lossier” way to select a subset Idea: Pick a subset that solves a Linear Program –Optimize some criterion –Satisfy lots of constraints –Support concept of partitioning

12 Details Partition tuples by a set of attributes Criterion: Maximize/Minimize Aggregate –E.g., Minimize Count(*) Global Constraints –E.g., Sum(B) in chosen subset > 60% Sum(B) in input Local Constraints - apply to each partition –E.g., Sum(B) in chosen subset > 40% Sum(B) in that partition

13 An Index Selection Example Partition by Tables, Join Conditions and attributes in WHERE clause Criterion: Maximize Sum(ExecutionCost) –Need best “coverage” Global Constraint: Count(*) ≤ 200 Local Constraint: Proportionate representation –A partition with 20% of input should have 20% of output –Count(*) ≥int(200*LOCAL.Count(*)/GLOBAL.Count(*))

14 Putting it all together 1.Apply dominance criterion (as earlier). 2.Apply representation (as earlier, but maximize SUM(DOM_Count) ). 3.Weight each tuple by the number of tuples it dominates. SELECT SqlString, DOMSUM(Count) FROM WkldTbl DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols (SLAVE.GroupByCols  MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX MASTER.OrderByCols) REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols MAXIMIZING SUM(DOM_Count) GLOBAL CONSTRAINT Count(*) ≤ 200 LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*))

15 Outline New Primitives for Summarization (Subsetting) –Dominance –Representation Implementing Summarization Primitives in SQL Experiments

16 Implementing Summarization Primitives in SQL Assume set and sequence support in SQL –The mills of the standards bodies… Partitioning useful for both primitives –Hashing, Sort-based, Index-based… Implementing Dominance –Naïve O(n 2 ) algorithm –Techniques from group-wise processing –Leverage Skyline optimizations

17 Representation Implementing directly is LP-hard Many queries are much simpler –Fall into one of two special cases Other queries are handled by a simple heuristic –User-guided search Implement as multiple operators

18 User-Guided Search Scan tuples in a specific order –User-specified, or heuristically chosen Will always minimize/maximize Count(*) –Use ordering to transform other objectives –Slightly different algorithms for the two cases

19 A Minimization Example Satisfied Violated Output A B D C E F

20 Two Special Cases Maximize SUM(Attr) –All constraints are on Count(*) –Use partitioning and sort-order access Minimize Count(*) –Single constraint: Again easily solved –More special cases also solvable –Multiple constraints: Approximation algorithm

21 Experiments Evaluate utility for index selection Compare to sophisticated Wkld. Compression [chaudhuri02] –Clusters using a complex distance function Simple query as described earlier –Constrained to output same number of statements as Workload Compression –Orders of magnitude faster TPC-H 1GB database –Multiple synthetic workloads introduced in [chaudhuri02]

22 Experiments (Contd.) Workload Compress Tuning Wizard Evaluate Total Estimated Cost

23 Comparing Estimated Costs

24 Conclusion Our contributions –Summarization can be expressed declaratively –Introduction of new operators for summarization –Discussion of SQL implementation The Future –An automatic monitoring and tuning infrastructure? –More workload-sensitive tasks?