Automated Selection of Materialized Views and Indexes for SQL Databases SANJAY AGRAWAL SURAJIT CHAUDHURI VIVEK NARASAYYA HASAN KUMAR REDDY A (09005065)

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

On-line Index Selection for Physical Database Tuning
Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Fundamentals of Python: From First Programs Through Data Structures
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
An Efficient Cost-Driven Selection Tool for Microsoft SQL Server Surajit ChaudhuriVivek Narasayya Indian Institute of Technology Bombay CS632 Course seminar.
CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads Debabrata Dash, Anastasia Ailamaki, Neoklis Polyzotis 1.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
1 Physical Database Design and Tuning Module 5, Lecture 3.
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.
CS 345: Topics in Data Warehousing Thursday, November 4, 2004.
Query Optimization Allison Griffin. Importance of Optimization Time is money Queries are faster Helps everyone who uses the server Solution to speed lies.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
CSC271 Database Systems Lecture # 30.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Case Base Maintenance(CBM) Fabiana Prabhakar CSE 435 November 6, 2006.
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Views Lesson 7.
CERN – European Organization for Nuclear Research Administrative Support - Internet Development Services CET and the quest for optimal implementation and.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Index Interactions in Physical Design Tuning Modeling, Analysis, and Applications Karl Schnaitter, UC Santa Cruz Neoklis Polyzotis, UC Santa Cruz Lise.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
The Volcano Optimizer Generator Extensibility and Efficient Search.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
To Tune or not to Tune? A Lightweight Physical Design Alerter Nico Bruno, Surajit Chaudhuri DMX Group, Microsoft Research VLDB’06.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Materialized View Selection and Maintenance using Multi-Query Optimization Hoshi Mistry Prasan Roy S. Sudarshan Krithi Ramamritham.
SQL/Lesson 7/Slide 1 of 32 Implementing Indexes Objectives In this lesson, you will learn to: * Create a clustered index * Create a nonclustered index.
Presented By Anirban Maiti Chandrashekar Vijayarenu
Session 1 Module 1: Introduction to Data Integrity
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Creating Indexes on Tables An index provides quick access to data in a table, based on the values in specified columns. A table can have more than one.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Practical Database Design and Tuning
An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server
Parallel Databases.
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Automatic Physical Design Tuning: Workload as a Sequence
Practical Database Design and Tuning
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Overview of Query Evaluation
A Framework for Testing Query Transformation Rules
Presentation transcript:

Automated Selection of Materialized Views and Indexes for SQL Databases SANJAY AGRAWAL SURAJIT CHAUDHURI VIVEK NARASAYYA HASAN KUMAR REDDY A ( ) 1

Outline  Motivation  Introduction  Architecture  Algorithm  Candidate Selection  Configuration Enumeration  Cost Estimation  Conclusion 2

Review : Materialized View  View: result of a stored query - logical  Materialized View: Physically stores the query result Additionally: can be indexed !  Any change in underlying tables may give rise to change in materialized view – immediate, deferred  pros: Improve performance of queries  cons: Increase in size of database, update overhead, asynchrony 3

Motivation  DBA have to administer manually – create indexes, materialized views, indexes on materialized views for performance tuning  Time consuming  prone to errors  Might not be able to handle continuously changing workload  Hence we need automatic DB tuning tools 4

Introduction  Both indexes and materialized views are physical structures that can accelerate performance  A materialized view - richer in structure - defined over multiple tables, and can have selections and GROUP BY over multiple columns.  An index can logically be considered as a special case of single table projection only materialized view 5

Problem  Determine an appropriate set of indexes, materialized views  Identifying a set of traditional indexes, materialized views and indexes on materialized views for the given workload that are worthy of further exploration  Picking attractive set from all potentially interesting mv set is practically not feasible 6

Architecture for Index and Materialized View Selection 7

Source: Automated Selection of Materialized Views and Indexes for SQL Databases [1] 8

Architecture for Index and Materialized View Selection  Assuming we are given a representative workload  Key components of Architecture  Syntactic structure selection  Candidate selection  Configuration enumeration  Configuration simulation and cost estimation 9

Architecture: syntactic structure selection  Identify syntactically relevant indexes, mv and indexes on mv  query Q : SELECT Sum(Sales) FROM Sales_Data WHERE City = 'Seattle’  Syntactically relevant materialized views (e.g.)  v1 : SELECT Sum(Sales) FROM Sales_Data WHERE City =‘Seattle’ 10

Architecture: syntactic structure selection  Identify syntactically relevant indexes, mv and indexes on mv  query Q : SELECT Sum(Sales) FROM Sales_Data WHERE City = 'Seattle’  Syntactically relevant materialized views (e.g.)  v2 : SELECT City, Sum(Sales) FROM Sales_Data GROUP BY City 11

Architecture: syntactic structure selection  Identify syntactically relevant indexes, mv and indexes on mv  query Q : SELECT Sum(Sales) FROM Sales_Data WHERE City = 'Seattle’  Syntactically relevant materialized views (e.g.)  v3 : SELECT City, Product, Sum(Sales) FROM Sales_Data GROUP BY City, Product 12

Architecture: syntactic structure selection  Identify syntactically relevant indexes, mv and indexes on mv  query Q : SELECT Sum(Sales) FROM Sales_Data WHERE City = 'Seattle’  Syntactically relevant materialized views (e.g.)  Optionally, we can consider additional indexes on the columns of the materialized view. 13

Architecture: Candidate Selection  Identifying a set of traditional indexes, materialized views and indexes on materialized views for the given workload which are worthy of further exploration  Note:  This paper focuses only on efficient selection of candidate materialized views  Assumes that candidate indexes have already been picked 14

Architecture : Configuration Enumeration  Determine ideal physical design – configuration  Search through the space in a naïve fashion is infeasible  over joint space of indexes and materialized views  Note:  Paper doesn’t discuss issues related to selection of indexes on materialized views 15

Greedy(m,k) algorithm for indexes Source: An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server [3] 16 Extra: Ref - 3

Greedy(m,k) algorithm  Returns a configuration consisting of a total of k indexes and materialized views.  It first picks an optimal configuration of size up to m (≤ k) by exhaustively enumerating all configurations of size up to m. (seed)  It then picks the remaining (k-m) structures greedily or until no further cost reduction in cost is possible by adding a structure  Configuration simulation and cost estimation module is responsible for evaluation cost of configurations 17 Extra: Ref - 3

Architecture: Configuration simulation and cost estimation  Supports other modules by providing cost estimation  Extension to database optimizer - what-if - simulate presence of mat. views and indexes that do not exist to query optimizer - cost(Q,C) - cost of query Q when the physical design is configuration C - optimizer costing module 18

“What-If” Indexes and Materialized Views  Provides interface to propose hypothetical (‘what- if’) indexes and mat views and quantitatively analyze their impact on performance  Note: Details and implantation on reference[2] 19 Extra: Ref – 2

Cost Evaluation (only indexes)  Naïve approach to evaluating a configuration: cost-evaluator asks the optimizer for a cost estimate for each query in workload  Idea: cost of query Q in config C 2 may be same as in config C 1 which was computed earlier  No need to recompute cost of Q in C 2  How to identify such situations?  A configuration C is atomic if for some query there is a possible execution that uses all indexes and mat views in C  Sufficient to evaluate only M’ configurations among M as long as all atomic configurations are included in M’ (Identifying M’ – Reference[3]) 20 Extra: Ref - 3

Cost of Configuration from Atomic Configurations  C : non-atomic configuration Q : a select/update query C i : atomic configuration of Q & C i is subset of C  Cost (Q,C) = Min i {Cost (Q, C i )} without invoking optimizer for C, if Cost(Q, C i ) already computed  For a select query inclusion of index can only reduce the cost   min cost over largest atomic configurations of Q 21 Extra: Ref - 3

Cost of Configuration from Atomic Configurations  Q : a insert/delete query  Cost of Q for non-atomic configuration C a) Cost of Selection b) Cost of updating table and indexes that may be used for selection c) Cost of updating indexes that don’t effect selection cost ( independent of each other and plan chosen for a & b )  Total cost = T + ∑ j (Cost(Q, {I j }) – Cost(Q, {})) 22 Extra: Ref - 3

Candidate Index Selection Source: An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server [3] 23 Extra: Ref - 3

Candidate Materialized View Selection  Given table-subset, no. of mat views arising from selection conditions and group by columns in the query is large  Goal: Quickly eliminate mat views that are syntactically relevant but are never used in answering any query  Obvious approach: Selecting one candidate materialized view per query that exactly matches each query in the workload  does not work since in many database systems the language of materialized views may not match the language of queries 24

Candidate selection: storage- constrained environments  Example 1: Consider a workload consisting of 1000 queries of the form: SELECT l_returnflag, l_linestatus, SUM(l_quantity) FROM lineitem WHERE l_shipdate BETWEEN and GROUP BY l_returnflag, l_linestatus Assume different constants for and  Alternate mv: SELECT l_shipdate, l_returnflag, l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY l_shipdate, l_returnflag, l_linestatus 25

Candidate selection: table- subsets with negligible impact  Table-subsets occur infrequently in the workload or they occur only in inexpensive queries  Example 2 Consider a workload of 100 queries whose total cost is 10,000 units. Let T be a table-subset that occurs in 25 queries whose combined cost is 50 units. Then even if we considered all syntactically relevant materialized views on T, the maximum possible benefit of those materialized views for the workload is 0.5% 26

Candidate selection: size of materialized view  Consider the TPC-H 1GB database and the workload specified in the benchmark.  There are several queries in which the tables lineitem, orders, nation, and region co-occur.  However, it is likely that materialized views proposed on the table-subset {lineitem, orders} are more useful than materialized views proposed on {nation, region}.  This is because the tables lineitem and orders have 6 million and 1.5 million rows respectively, but tables nation and region are very small (25 and 5 rows respectively).  The benefit of pre-computing the portion of the queries involving {nation, region} is insignificant compared to the benefit of pre-computing the portion of the query involving {lineitem, orders}. 27

Candidate materialized view selection 1. Arrive at a smaller set of interesting table-subsets 2. Propose a set of mat views for each query We select a configuration that is best for that query, using cost-based analysis 3. Generate a set of merged mat views  merged mat views U mat views(2) enters configuration enumeration. 28

Finding Interesting Table-Subsets 29

Finding Interesting Table-Subsets 30

Finding Interesting Table-Subsets Source: Automated Selection of Materialized Views and Indexes for SQL Databases [1] 31

Syntactically Relevant Materialized Views  Mat views only on the table-subset that exactly matches the tables references in Q i – insufficient  Language of mat views  Algebraic transformation of queries by optimizer  Exact match might not even be deemed interesting  Consider smaller interesting table subsets  On all interesting table subsets that occur in Q i - Effective pruning 32

Syntactically Relevant Materialized Views  For each interesting table-subset T  A ‘pure-join’ mat view on T containing join and selection conditions in Q i on tables in T  If Q i has grouping columns, also include GROUP BY columns and aggregate expressions from Q i on tables in T  May also include only a subset of selection conditions  But that is considered during view merging step  For each mat view, we propose a set of clustered and non-clustered indexes (details excluded) 33

Exploiting Query Optimizer to Prune Syntactically Relevant Materialized Views  Still many mat views may not be used in answering any query – decision made by query optimizer based on cost estimation  Intuition: If mat view is not part of best solution of any query in workload, its unlikely to be part of the best solution for entire workload 34

Exploiting Query Optimizer to Prune Syntactically Relevant Materialized Views  For a given query Q, and a set S of materialized views (and indexes on them) proposed for Q, function Find-Best-Configuration(Q, S) returns the best configuration for Q from S  Cost-based - by optimizer  Any suitable search e.g. Greedy(m,k) can be used 35

Exploiting Query Optimizer to Prune Syntactically Relevant Materialized Views Source: Automated Selection of Materialized Views and Indexes for SQL Databases [1] What if updates or storage constrains are present? 36

View Merging  Under storage constraints, sub-optimal recommendations (see example 1) SELECT l_returnflag, l_linestatus, SUM(l_quantity) FROM lineitem WHERE l_shipdate BETWEEN and GROUP BY l_returnflag, l_linestatus SELECT l_shipdate, l_returnflag, l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY l_shipdate, l_returnflag, l_linestatus  Directly analyzing multiple queries at once – infeasible  Observation: Set of mat views returned, M are selected on cost-basis and are vey likely to be used by optimizer  Additional ‘merged’ mat views are derived from M 37

Merging a pair of views  Sequence of pair-wise merges  Parent view pair is merged to generate merged view  Criteria for merging  All queries that can be answered using either of parent views should be answerable using merged view  Cost of answering queries using merged view should not be significantly higher than the cost of answering queries using views in M 38

View Merging – criteria  All queries that can be answered using either of parent views should be answerable using merged view  Syntactically modifying parent views as little as possible Retain common aspect and generalize only differences  Cost of answering queries using merged view should not be significantly higher than the cost of answering queries using views in M  Prevent a merged view(v) from being generated if it is much larger that views in Parent-Closure(v) – set of views in M from which v is derived 39

Merging Pair of Views Source: Automated Selection of Materialized Views and Indexes for SQL Databases [1] 40

Algorithm for generating merged views  Merging merged mat views with mv from M or another merged mv  Much fewer than exponential of M merged mvs are explored – checks built into step 4  Set of merged mvs does not depend on sequence of merges 41

Algorithm for generating merged views Source: Automated Selection of Materialized Views and Indexes for SQL Databases [1] 42

Key Techniques  How to identify interesting set of tables such that we need to consider materialized views only over such set of tables  Finding relevant materialized views and further pruning of them based on cost  View merging technique that identifies candidate materialized views that while not optimal for any single query can be beneficial to multiple queries in the workload 43

Role of interaction between indexes and materialized views  Together significantly improve performance  Impact more in presence of storage constraints and updates  Both are redundant structures that speed up query execution, compete for same resources – storage and incur maintenance overhead in presence of updates  Interact – presence of one may make other more attractive  Our approach - joint enumeration of space of candidate indexes and materialized views 44

Joint enumeration vs alternatives  Two alternatives to JOINTSEL 1. MVFIRST – mv first and then ind 2. INDFIRST – indexes first and then mv  Global storage bound is S, fraction f (0 ≤ f ≤ 1) a storage constraint f*S is applied to selection of first feature  f depends on several attributes of workload amount of updates, complexity of queries, absolute value of total storage bound  Even at optimal f, JOINTSEL is better in most cases  MVFIRST – adversely affect quality of indexes picked 45

Joint enumeration vs alternatives  JOINTSEL - two attractions 1. A graceful adjustment to storage bounds 2. Considering interactions between candidate indexes and candidate materialized views  e.g. a query Q for which I 1 and I 2, v are candidates Assume I 1 alone reduces cost of Q by 25 units and I 2 by 30 units, but I 1 and v together reduce cost by 100 units. Using INDFIRST I 2 would eliminate I 1 resulting in suboptimal recommendation  Greedy(m,k) – treats indexes, materialized views and indexes on materialized views on same footing 46

Experiments  Algorithms presented in this paper are implemented on Microsoft SQL Server 2000 Hypotheses set:  Selecting Candidate mat views  Identifying interesting table-subsets substantially reduces mat views without eliminating useful ones  View-merging algorithms significantly improves quality, especially under storage constraints  Architectural issues  Candidate selection module reduces runtime maintaining quality recommendations  Configuration enumeration module Greedy(m.k) gives results comparable to exhaustive algorithm  JOINTSEL better than MVFIRST or INDFIRST 47

Identifying interesting table-subsets 48 Threshold C = 10% Significant Pruning of space With small drop in quality

View Merging 49 With increase in storage, both converge Add. Merged views: 19% Increase in runtime: 9%

Candidate Selection 50 No. of mat views grows linearly with workload size – hence scalable

Candidate Selection 51 candidate selection not only reduces the running time by several orders of magnitude, but the drop in quality resulting from this pruning is very small

Configuration Enumeration 52 m =2 Greedy(m,k) gives a solution comparable in quality to exhaustive enumeration. Yet, in time magnitudes faster

JOINTSEL vs MVFIRST vs INDFIRST 53 Even with no storage constraints JOINSEL is significantly better than MVFIRST or INDFIRST

JOINTSEL vs MVFIRST vs INDFIRST 54 For a given database and workload optimal partitioning varies with storage constraint

JOINTSEL vs MVFIRST vs INDFIRST 55 Best allocation fraction different for each workload Consistent high quality of JOINTSEL, though the runtimes are comparable (+/- 10%)

Conclusions  Key take away from paper would be theoretical framework and appropriate abstractions from physical database design that is able to capture its complexities and compare properties of alternate algorithms  Though many assumptions are made while formulating, they are all later supported by experiments 56

References 1. Automated Selection of Materialized Views and Indexes for SQL Databases. Surajit Chaudhuri, Vivek Narasayya, and Sanjay Agrawal., VLDB AutoAdmin “What-If” Index Analysis Utility. Chaudhuri S., Narasayya V., ACM SIGMOD An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. Chaudhuri S., Narasayya V., VLDB

Thank you 58