Automatic Physical Design Tuning: Workload as a Sequence

Slides:

Advertisements

Similar presentations

On-line Index Selection for Physical Database Tuning

Advertisements

Robust query processing Goetz Graefe, Christian König, Harumi Kuno, Volker Markl, Kai-Uwe Sattler Dagstuhl – September 2010.

Hopkins Storage Systems Lab, Department of Computer Science Automated Physical Design in Database Caches T. Malik, X. Wang, R. Burns Johns Hopkins University.

Incremental Maintenance of XML Structural Indexes Ke Yi 1, Hao He 1, Ioana Stanoi 2 and Jun Yang 1 1 Department of Computer Science, Duke University 2.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

The Volcano/Cascades Query Optimization Framework

ICS-171:Notes 4: 1 Notes 4: Optimal Search ICS 171 Summer 1999.

Dynamic Pickup and Delivery with Transfers* P. Bouros 1, D. Sacharidis 2, T. Dalamagas 2, T. Sellis 1,2 1 NTUA, 2 IMIS – RC “Athena” * To appear in SSTD’11.

S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.

Zoo-Keeper’s Problem An O(nlogn) algorithm for the zoo-keeper’s problem Sergei Bespamyatnikh Computational Geometry 24 (2003), pp th CGC Workshop.

Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.

An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya Indian Institute of Technology Bombay CS632 Course seminar.

CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads Debabrata Dash, Anastasia Ailamaki, Neoklis Polyzotis 1.

Automated Selection of Materialized Views and Indexes for SQL Databases SANJAY AGRAWAL SURAJIT CHAUDHURI VIVEK NARASAYYA HASAN KUMAR REDDY A ( )

Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.

Compression & Huffman Codes

1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.

Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.

Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

Time-Variant Spatial Network Model Vijay Gandhi, Betsy George (Group : G04) Group Project Overview of Database Research Fall 2006.

Branch and Bound Algorithm for Solving Integer Linear Programming

CSE 242A Integrated Circuit Layout Automation Lecture: Partitioning Winter 2009 Chung-Kuan Cheng.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

Low-Power Gated Bus Synthesis for 3D IC via Rectilinear Shortest-Path Steiner Graph Chung-Kuan Cheng, Peng Du, Andrew B. Kahng, and Shih-Hung Weng UC San.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.

Materialized View Selection for XQuery Workloads Asterios Katsifodimos 1, Ioana Manolescu 1 & Vasilis Vassalos 2 1 Inria Saclay & Université Paris-Sud,

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.

Multi-Query Optimization and Applications Prasan Roy Indian Institute of Technology - Bombay.

Speeding Up Warehouse Physical Design Using A Randomized Algorithm Minsoo Lee Joachim Hammer Dept. of Computer & Information Science & Engineering University.

Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.

A Fully Polynomial Time Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion Shiyan Hu*, Zhuo Li**, Charles Alpert** *Dept of Electrical.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Artificial Intelligence Solving problems by searching.

Lecture 3: Uninformed Search

CMPT 438 Algorithms.

Dynamic Pickup and Delivery with Transfers

Outline Introduction State-of-the-art solutions

Advanced Algorithms Analysis and Design

Compression & Huffman Codes

CPS216: Data-intensive Computing Systems

CSCI5570 Large Scale Data Processing Systems

CS 540 Database Management Systems

An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server

New Characterizations in Turnstile Streams with Applications

Antonio Abalos Castillo

Scalability for Search

Lecture 5 Dynamic Programming

RE-Tree: An Efficient Index Structure for Regular Expressions

Unit Test Pattern.

Multi - Way Number Partitioning

Lecture 5 Dynamic Programming

DATA CACHING IN WSN Mario A. Nascimento Univ. of Alberta, Canada

Dynamic Programming General Idea

MURI Kickoff Meeting Randolph L. Moses November, 2008

Sungho Kang Yonsei University

Algorithms for Budget-Constrained Survivable Topology Design

Greedy Algorithms TOPICS Greedy Strategy Activity Selection

Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.

Incremental Maintenance of XML Structural Indexes

Dynamic Programming General Idea

A Framework for Testing Query Transformation Rules

Wednesday, 5/8/2002 Hash table indexes, physical operators

Unit –VII Coping with limitations of algorithm power.

Donghui Zhang, Tian Xia Northeastern University

Branch-and-Bound Algorithm for Integer Program

Presentation transcript:

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal, Microsoft Research Eric Chu, University of Wisconsin-Madison Vivek Narasayya, Microsoft Research

Automatic Physical Design Tuning DB applications more complex and varied. Considerable time spent on tuning. Reduce cost of ownership of RDBMS. Automatically recommend physical design. Supported by DB vendors. Database Engine Tuning Advisor, Microsoft Design Advisor, IBM SQL Access Advisor, Oracle 11/21/2018 SIGMOD 2006

Microsoft Database Engine Tuning Advisor Set of queries, updates Applications Workload Query Optimizer (extended) Database Engine Tuning Advisor “What-if” Set of indexes, materialized views, horizontal partitions Microsoft SQL Server 2005 Recommendation 11/21/2018 SIGMOD 2006

Workload as a Sequence: Motivation Data warehousing Query by day, update at night. Set: No index recommended when update costs outweigh benefits. Sequence: May exploit benefits of indexes without incurring update costs. Insert “create” and “drop” of indexes to workload. Exploit order of statements. Create Indexes Drop Indexes Updates Night Queries Day 11/21/2018 SIGMOD 2006

Set VS Sequence Set-based Outputs are different Recommendation is robust to changes in order of statement arrival. Can miss good recommendations compared to sequenced-based approach. Outputs are different Set: what indexes to create or drop? Sequence: what indexes to create or drop and where? Create Indexes Drop Indexes Queries Updates Queries 11/21/2018 SIGMOD 2006

Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy-SEQ Experiments 11/21/2018 SIGMOD 2006

Problem Setting Cost(Si,Ci) – cost of executing Si with Ci. Workload: S = [S1, S2, …, SN] CN+1 C0 C1 C2 C3 CN S2 S1 S3 SN Si {Select, Insert, Delete, Update} Cost(Si,Ci) – cost of executing Si with Ci. TC(C1, C2) – transition cost Sequence execution cost Nk=1((Cost(Sk,Ck) + TC(Ck-1,Ck)) + TC (CN,CN+1) 11/21/2018 SIGMOD 2006

Problem Definition Given: Database D, workload W = [S1, …, SN], initial configuration C0, and storage bound M. Find configurations C1, C2, …, CN+1 such that Minimize sequence execution cost: Nk=1((Cost(Sk,Ck) + TC(Ck-1,Ck)) + TC (CN,CN+1) Storage of Ci ≤ M, for all i. 11/21/2018 SIGMOD 2006

Search Space Given N statements and M indexes Sequence-based tuning 2M distinct configurations for each statement. 2M(N+1) possible execution sequences. Set-based tuning 2M configurations. 11/21/2018 SIGMOD 2006

Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments 11/21/2018 SIGMOD 2006

Optimal Algorithm for Single-Index Case { } {I} S1 { } {I} S2 { } {I} SN Id Ic SOURCE { } Id { } DESTINATION Ic DAG for single index, N statements Node costs: Cost(Si, { }) and Cost(Si,{I}). Edge costs: 0, IC, and ID. Cost of shortest path includes node and edge costs. 11/21/2018 SIGMOD 2006

General Case – Multiple Indexes SN EXHAUSTIVE CF1 CF2 CFN C0 Ci1 Ci2 CiN CN+1 C11 C12 C1N C01 C02 C0N At each stage, enumerate all possible configurations from the set of indexes. Algorithm linear in the number of nodes and edges of DAG. However, number of nodes in DAG is exponential in the number of indexes. M indexes => O(N*2M) nodes and O(N*2M) edges. 11/21/2018 SIGMOD 2006

Solve sequence using EXHAUSTIVE Optimal Solution Recommendation Candidate set of structures Solve sequence using EXHAUSTIVE Sequence, Constraints 11/21/2018 SIGMOD 2006

Search-Space Pruning Techniques to reduce number of nodes: Cost-based Pruning Leverages shortest-path solutions of individual indexes. Prunes configurations at each stage without loss of optimality. Disjoint Sequences Divide-and-conquer approach. Splits the input sequence and candidate index set. Greedy-SEQ Guarantees a polynomial number of nodes. 11/21/2018 SIGMOD 2006

Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments 11/21/2018 SIGMOD 2006

Exploiting Disjoint Sequences Two sequences X and Y are disjoint if they do not share any statements AND indexes. Disjoint sequences are common E.g., server hosts multiple applications that touch different databases. Approach: Split workload into disjoint sequences. Solve each sequence independently. Merge to get final solution. Idea: DAG for each disjoint sequence has fewer nodes. 11/21/2018 SIGMOD 2006

Efficiency Gain with Disjoint Sequences {I1,I2,I3} W 8 nodes at each stage S1 S3 S4 {I1} S2 S5 S6 {I2} S7 {I3} W1 W2 W3 2 nodes at each stage for each sequence 11/21/2018 SIGMOD 2006

Merge solutions of W1, W2, and W3: No storage violations DEST I1c S1 S3 SRC {I1} S4 { } I1d W1 = [S1,S3,S4] S2 DEST S5 S6 I2d W2 = [S2,S5,S6] I2c {I2} { } SRC DEST S7 I3c {I3} { } W3 = [S7] SRC Pu is optimal when there are no storage violations. S2 {I1,I2} S3 S1 SRC {I1} S4 {I2} S5 S6 { } S7 {I3} DEST 11/21/2018 SIGMOD 2006

Merge in the presence of storage violation Suppose storage bound allows only 1 index. Pu is not a valid solution as it has configurations with storage violation. S2 {I1,I2} S3 S1 SRC {I1} S4 {I2} S5 S6 { } S7 {I3} DEST S4 {I2} S5 S6 { } {I3} DEST S7 S1 SRC {I1} { } S2 S3 {I1} {I2} Pu’ = Merge P1, P2 and P3 to get a valid solution. Note that cost of Pu is a lower bound on cost of any valid solution. 11/21/2018 SIGMOD 2006

Solution with Split and Merge Sequence, Constraints Candidate set of structures Apply Split operator to get disjoint sequences Solve each sequence independently using EXHAUSTIVE or GREEDY-SEQ Merge results of disjoint sequences Recommendation 11/21/2018 SIGMOD 2006

Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments 11/21/2018 SIGMOD 2006

Greedy Approach Goal: Explore a polynomial number of good configurations. Run shortest path over the DAG constructed with these configurations. Solution close to optimal. Greedy-SEQ: adaptation of existing greedy technique for the sequence model. 11/21/2018 SIGMOD 2006

Greedy-SEQ Steps of Greedy-SEQ: Get optimal solution for each index. Record configurations. Initialize current best to be the lowest-cost solution seen so far. Improve current best by combining with other solutions and resetting current best. Record new configurations of current best. Repeat until no more improvement. Run shortest-path over configurations collected. 11/21/2018 SIGMOD 2006

Combining Two Single-Index Solutions SN SK SL S0 SN+1 {I1} {} I1 I2 {I2} {I1} {} {I2} I1,I2 {I1,I2} 11/21/2018 SIGMOD 2006

Combining Two Single-Index Solutions SN SK SL S0 SN+1 {I1} {} I1 I2 {I2} {I1} {I1} {I1} {} {} {} {I2} {I2} {} I1,I2 {I2} {} {I1,I2} {I1,I2} 11/21/2018 SIGMOD 2006

Greedy-SEQ: Greedy Approach Get optimal solution for each index. Record configurations. Initialize current best to be the lowest-cost solution seen so far. Improve current best by combining with other solutions and resetting current best. Record new configurations of current best. Repeat Step 3 until no more improvement. Run shortest-path over configurations collected. 11/21/2018 SIGMOD 2006

End-to-End Solution Candidate set of structures Sequence, Constraints Candidate set of structures Recommendation Apply split operator to get disjoint sequences Solve each sequence independently using EXHAUSTIVE or GREEDY-SEQ Merge results of disjoint sequences Apply cost-based pruning on each sequence 11/21/2018 SIGMOD 2006

Model Workload as a Sequence Motivation Problem Definition Optimal Algorithm Disjoint Sequences Greedy Heuristic Experiments 11/21/2018 SIGMOD 2006

Sequence VS Set-based approaches % improvement relative to the optimal set-based solution. Sequence is better in the presence of updates and/or storage bound is low. Workload M = 1.2 GB M = 3 GB TPCH-22 19% 0% TPCH-22-I-10-MID 22% 16% TPCH-22-I-10-END 25% 28% 11/21/2018 SIGMOD 2006

Greedy-SEQ VS Exhaustive Greedy-SEQ’s much faster with minimal degradation in quality. Workload % reduction in running time % reduction in quality TPCH-3 50% <1% TPCH-5-M-5 98.4% 2.3% TPCH-22 Exhaustive was terminated after 24 hours Not available 11/21/2018 SIGMOD 2006

Effectiveness of Split and Merge With split and merge (SPMR) VS without (WO-SPMR) Workload % reduction in running time compared to WO-SPMR % reduction in quality compared to WO-SPMR TPCH-22 <0.1% 0% WKLD1 89.9% WKLD1-LOW 71.4% 3.0% 11/21/2018 SIGMOD 2006

Conclusion Sequence model allows more optimization opportunities than set model. Model the problem as finding the shortest path over a DAG. Heuristics give nearly optimal solutions with much better performance. 11/21/2018 SIGMOD 2006