CSE544 Database Statistics Tuesday, February 15 th, 2011 Dan Suciu -- 544, Winter 20111.

Slides:



Advertisements
Similar presentations
Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )
Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Chapter IV Relational Data Model Pemrograman Sistem Basis Data.
CS4432: Database Systems II
Query Optimization May 31st, Today A few last transformations Size estimation Join ordering Summary of optimization.
Chapter 3 : Relational Model
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Fan Qi Database Lab 1, com1 #01-08 CS3223 Tutorial 8.
1 Overview of Query Evaluation Chapter Objectives  Preliminaries:  Core query processing techniques  Catalog  Access paths to data  Index matching.
1 Relational Query Optimization Module 5, Lecture 2.
ICS 421 Spring 2010 Relational Query Optimization Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 9/8/20091Lipyeow.
Relational Query Optimization 198:541. Overview of Query Optimization  Plan: Tree of R.A. ops, with choice of alg for each op. Each operator typically.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
Query Rewrite: Predicate Pushdown (through grouping) Select bid, Max(age) From Reserves R, Sailors S Where R.sid=S.sid GroupBy bid Having Max(age) > 40.
Quick Review of Apr 17 material Multiple-Key Access –There are good and bad ways to run queries on multiple single keys Indices on Multiple Attributes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Query Evaluation Chapter 12.
CMSC724: Database Management Systems Instructor: Amol Deshpande
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapter 15.
Query Optimization Overview Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems December 2, 2004 Some slide content derived.
Overview of Query Optimization v Plan : Tree of R.A. ops, with choice of alg for each op. –Each operator typically implemented using a `pull’ interface:
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Overview of Implementing Relational Operators and Query Evaluation
Introduction to Database Systems1 Relational Query Optimization Query Processing: Topic 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
1 Overview of Query Evaluation Chapter Overview of Query Evaluation  Plan : Tree of R.A. ops, with choice of alg for each op.  Each operator typically.
Database systems/COMP4910/Melikyan1 Relational Query Optimization How are SQL queries are translated into relational algebra? How does the optimizer estimates.
Advanced Databases: Lecture 8 Query Optimization (III) 1 Query Optimization Advanced Databases By Dr. Akhtar Ali.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
1 Relational Query Optimization Chapter Query Blocks: Units of Optimization  An SQL query is parsed into a collection of query blocks :  An SQL.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Relational Algebra and Relational Calculus.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 12 – Introduction to.
Implementation of Database Systems, Jarek Gryz1 Relational Query Optimization Chapters 12.
Cost Estimation For each plan considered, must estimate cost: –Must estimate cost of each operation in plan tree. Depends on input cardinalities. –Must.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
1 Introduction to Database Systems, CS420 SQL Views and Indexes.
Southern Methodist University CSE CSE 2337 Introduction to Data Management Chapter 2.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
15.1 – Introduction to physical-Query-plan operators
Introduction to Query Optimization
COST ESTIMATION FOR THE RELATIONAL ALGEBRA OPERATIONS MIT 813 GROUP 15 PRESENTATION.
Chapter 15 QUERY EXECUTION.
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Outline Introduction Background Distributed DBMS Architecture
Relational Query Optimization
Query Execution Presented by Jiten Oswal CS 257 Chapter 15
Overview of Query Evaluation
Monday, 5/13/2002 Hash table indexes, query optimization
Overview of Query Evaluation
Statistics Profile For Query Optimization
CSE572: Data Mining by H. Liu
Relational Query Optimization
Relational Query Optimization
CS222P: Principles of Data Management UCI, Fall 2018 Notes #12 Set operations, Aggregation, Cost Estimation Instructor: Chen Li.
Presentation transcript:

CSE544 Database Statistics Tuesday, February 15 th, 2011 Dan Suciu , Winter 20111

Outline Chapter 15 in the textbook Dan Suciu , Winter 20112

3 Query Optimization Three major components: 1.Search space 2.Algorithm for enumerating query plans 3.Cardinality and cost estimation

Dan Suciu , Winter Cardinality and Cost Estimation Collect statistical summaries of stored data Estimate size (=cardinality) in a bottom-up fashion –This is the most difficult part, and still inadequate in today’s query optimizers Estimate cost by using the estimated size –Hand-written formulas, similar to those we used for computing the cost of each physical operator

Dan Suciu , Winter Statistics on Base Data Collected information for each relation –Number of tuples (cardinality) –Indexes, number of keys in the index –Number of physical pages, clustering info –Statistical information on attributes Min value, max value, number distinct values Histograms –Correlations between columns (hard) Collection approach: periodic, using sampling

Size Estimation Problem Dan Suciu , Winter S = SELECT list FROM R1, …, Rn WHERE cond 1 AND cond 2 AND... AND cond k Given T(R1), T(R2), …, T(Rn) Estimate T(S) How can we do this ? Note: doesn’t have to be exact.

Size Estimation Problem Dan Suciu , Winter Remark: T(S) ≤ T(R1) × T(R2) × … × T(Rn) S = SELECT list FROM R1, …, Rn WHERE cond 1 AND cond 2 AND... AND cond k

Selectivity Factor Each condition cond reduces the size by some factor called selectivity factor Assuming independence, multiply the selectivity factors Dan Suciu , Winter 20118

Example Dan Suciu , Winter SELECT * FROM R, S, T WHERE R.B=S.B and S.C=T.C and R.A<40 SELECT * FROM R, S, T WHERE R.B=S.B and S.C=T.C and R.A<40 R(A,B) S(B,C) T(C,D) T(R) = 30k, T(S) = 200k, T(T) = 10k Selectivity of R.B = S.B is 1/3 Selectivity of S.C = T.C is 1/10 Selectivity of R.A < 40 is ½ What is the estimated size of the query output ?

Rule of Thumb If selectivities are unknown, then: selectivity factor = 1/10 [System R, 1979] Dan Suciu , Winter

11 Using Data Statistics Condition is A = c /* value selection on R */ –Selectivity = 1/V(R,A) Condition is A < c /* range selection on R */ –Selectivity = (c - Low(R, A))/(High(R,A) - Low(R,A))T(R) Condition is A = B /* R A=B S */ –Selectivity = 1 / max(V(R,A),V(S,A)) –(will explain next) Dan Suciu , Winter 2011

12 Assumptions Containment of values: if V(R,A) <= V(S,B), then the set of A values of R is included in the set of B values of S –Note: this indeed holds when A is a foreign key in R, and B is a key in S Preservation of values: for any other attribute C, V(R A=B S, C) = V(R, C) (or V(S, C)) Dan Suciu , Winter 2011

13 Selectivity of R A=B S Assume V(R,A) <= V(S,B) Each tuple t in R joins with T(S)/V(S,B) tuple(s) in S Hence T(R A=B S) = T(R) T(S) / V(S,B) In general: T(R A=B S) = T(R) T(S) / max(V(R,A),V(S,B)) Dan Suciu , Winter 2011

14 Size Estimation for Join Example: T(R) = 10000, T(S) = V(R,A) = 100, V(S,B) = 200 How large is R A=B S ? Dan Suciu , Winter 2011

15 Histograms Statistics on data maintained by the RDBMS Makes size estimation much more accurate (hence, cost estimations are more accurate) Dan Suciu , Winter 2011

Histograms Dan Suciu , Winter Employee(ssn, name, age) T(Employee) = 25000, V(Empolyee, age) = 50 min(age) = 19, max(age) = 68 σ age=48 (Empolyee) = ?σ age>28 and age<35 (Empolyee) = ?

Histograms Dan Suciu , Winter Employee(ssn, name, age) T(Employee) = 25000, V(Empolyee, age) = 50 min(age) = 19, max(age) = 68 Estimate = / 50 = 500Estimate = * 6 / 60 = 2500 σ age=48 (Empolyee) = ?σ age>28 and age<35 (Empolyee) = ?

Histograms Dan Suciu , Winter Age: > 60 Tuples Employee(ssn, name, age) T(Employee) = 25000, V(Empolyee, age) = 50 min(age) = 19, max(age) = 68 σ age=48 (Empolyee) = ?σ age>28 and age<35 (Empolyee) = ?

Histograms 19 Employee(ssn, name, age) T(Employee) = 25000, V(Empolyee, age) = 50 min(age) = 19, max(age) = 68 Estimate = 1200Estimate = 2*80 + 5*500 = 2660 Age: > 60 Tuples σ age=48 (Empolyee) = ?σ age>28 and age<35 (Empolyee) = ?

Types of Histograms How should we determine the bucket boundaries in a histogram ? Dan Suciu , Winter

Types of Histograms How should we determine the bucket boundaries in a histogram ? Eq-Width Eq-Depth Compressed V-Optimal histograms Dan Suciu , Winter

Histograms Dan Suciu , Winter Age: > 60 Tuples Employee(ssn, name, age) Age: > 60 Tuples Eq-width: Eq-depth: Compressed: store separately highly frequent values: (48,1900)

Some Definitions [Ioannidis: The history of histograms, 2003] Relation R, attribute X Value set of R.X is V = {v 1 <v 2 <…<v n } Spread of v k is s k = v k+1 -v k Frequency f k = # of tuples with R.X=v k Area a k = s k x f k Dan Suciu , Winter

V-Optimal Histograms Defines bucket boundaries in an optimal way, to minimize the error over all point queries Computed rather expensively, using dynamic programming Modern databases systems use V- optimal histograms or some variations Dan Suciu , Winter

V-Optimal Histograms Formally: The data distribution of R.X is: –T = {(v 1,f 1 ), …, (v n,f n )} A histogram for R.X with β buckets is: –H = {(u 1,e 1 ), …, (u β,e β )} The estimation at point x is: e(x) = e i, where u i ≤ x < u i+1 Dan Suciu , Winter

V-Optimal Histograms Formally: The v-optimal histogram with β buckets is the histogram H = {(u 1,g 1 ), …, (u β,f β )} that minimizes: Σ i=1,n |e(v i ) – f i | 2 Can be computed using dynamic programming Dan Suciu , Winter

Difficult Questions on Histograms Small number of buckets –Hundreds, or thousands, but not more –WHY ? Not updated during database update, but recomputed periodically –WHY ? Multidimensional histograms rarely used –WHY ? Dan Suciu , Winter

Summary of Query Optimization Three parts: –search space, algorithms, size/cost estimation Ideal goal: find optimal plan. But –Impossible to estimate accurately –Impossible to search the entire space Goal of today’s optimizers: –Avoid very bad plans Dan Suciu , Winter