low level data manipulation

Slides:



Advertisements
Similar presentations
Physical Database Design and Tuning R&G - Chapter 20 Although the whole of this life were said to be nothing but a dream and the physical world nothing.
Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Outline  Introduction  Background  Distributed DBMS Architecture  Distributed Database Design  Semantic Data Control ➠ View Management ➠ Data Security.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed Query Processing –An Overview
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapters 14.
1 CSE 480: Database Systems Lecture 22: Query Optimization Reference: Read Chapter 15.6 – 15.8 of the textbook.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
1 Relational Query Optimization Module 5, Lecture 2.
Query processing and optimization. Advanced DatabasesQuery processing and optimization2 Definitions Query processing –translation of query into low-level.
DB performance tuning using indexes Section 8.5 and Chapters 20 (Raghu)
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Query Optimization II R&G, Chapters 12, 13, 14 Lecture 9.
ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization See Sections 15.1, 2, 3, 7.
ICS (072)Query Processing and Optimization 1 Chapter 15 Algorithms for Query Processing and Optimization ICS 424 Advanced Database Systems Dr.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Query Optimization Chapter 15.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Query Processing & Optimization
Chapter 19 Query Processing and Optimization
©Silberschatz, Korth and Sudarshan14.1Database System Concepts 3 rd Edition Chapter 14: Query Optimization Overview Catalog Information for Cost Estimation.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Query Processing Presented by Aung S. Win.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
Advanced Databases: Lecture 8 Query Optimization (III) 1 Query Optimization Advanced Databases By Dr. Akhtar Ali.
1 6. Distributed Query Optimization Chapter 9 Optimization of Distributed Queries.
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.7/1 Οι διαφάνειες καλύπτουν μέρος των Κεφαλαίων 7&8: Distributed Database QueryProcessing and Optimization.
Query Optimization. Query Optimization Query Optimization The execution cost is expressed as weighted combination of I/O, CPU and communication cost.
Overview of Query Processing
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Copyright © Curt Hill Query Evaluation Translating a query into action.
Query Processor  A query processor is a module in the DBMS that performs the tasks to process, to optimize, and to generate execution strategy for a high-level.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.8/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Chapter 18 Query Processing. 2 Chapter - Objectives u Objectives of query processing and optimization. u Static versus dynamic query optimization. u How.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
CSCI Query Processing1 QUERY PROCESSING & OPTIMIZATION Dr. Awad Khalil Computer Science Department AUC.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 13: Query Processing
L4: Query Optimization (1) - 1 L4: Query Processing and Optimization v 4.1 Query Processing  Query Decomposition  Data Localization v 4.1 Query Optimization.
Database Applications (15-415) DBMS Internals- Part IX Lecture 20, March 31, 2016 Mohammad Hammoud.
CS742 – Distributed & Parallel DBMSPage 3. 1M. Tamer Özsu Outline Introduction & architectural issues Data distribution  Distributed query processing.
CHAPTER 19 Query Optimization. CHAPTER 19 Query Optimization.
Database Management System
Chapter 15 QUERY EXECUTION.
Outline Introduction Background Distributed DBMS Architecture
Advance Database Systems
Overview of Query Evaluation
Distributed Database Management Systems
Presentation transcript:

low level data manipulation Query Processing high level user query (SQL) Query Compiler Plan Generator Cost Estimator Plan Evaluator Query Processor low level data manipulation commands (execution plan)

Query Processing Components Query language that is used SQL: “intergalactic dataspeak” Query execution methodology The steps that one goes through in executing high-level (declarative) user queries. Query optimization How do we determine a good execution plan?

What are we trying to do? Consider query In SQL SELECT Ename, Title “For each project whose budget is greater than $250000 and which employs more than two employees, list the names and titles of employees.” In SQL SELECT Ename, Title FROM Emp, Project, Works WHERE Budget > 250000 AND Emp.Eno=Works.Eno AND Project.Pno=Works.Pno AND Project.Pno IN (SELECT w.Pno FROM Works w GROUP BY w.Pno HAVING SUM(*) > 2) How to execute this query?

A Possible Execution Plan T1  Scan Project table and select all tuples with Budget value > 250000 T2  Join T1 with the Works relation T3  Join T2 with the Emp relation T4  Group tuples of T3 over Pno Scan tuples in each group of T4 and for groups that have more than 2 tuples, Project over Ename, Title Note: Overly simplified – we’ll detail later.

Pictorial Representation PEname, Title T4 Group by How do we get this plan? How do we execute each of the nodes? T3 ⋈ T2 ⋈ Emp T1 sBudget>250000 Works Project PEname, Title(GroupPno,Eno(Emp⋈(sBudget>250000Project⋈Works)))

Query Processing Methodology SQL Queries Normalization System Catalog Analysis Simplification Restructuring Optimization “Optimal” Execution Plan

Query Normalization Lexical and syntactic analysis check validity (similar to compilers) check for attributes and relations type checking on the qualification Put into (query normal form Conjunctive normal form (p11p12…p1n) … (pm1pm2…pmn) Disjunctive normal form (p11p12 …p1n) … (pm1 pm2…pmn) OR's mapped into union AND's mapped into join or selection

Analysis Refute incorrect queries Type incorrect If any of its attribute or relation names are not defined in the global schema If operations are applied to attributes of the wrong type Semantically incorrect Components do not contribute in any way to the generation of the result Only a subset of relational calculus queries can be tested for correctness Those that do not contain disjunction and negation To detect connection graph (query graph) join graph

Analysis – Example SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’ Query graph Join graph Dur>36 Works Works Emp.Eno=Works.Eno Works.Pno=Project.Pno Emp.Eno=Works.Eno Works.Pno=Project.Pno Emp Project Emp Project Title = ‘Programmer’ Resp Ename RESULT Pname=‘CAD/CAM’

Analysis If the query graph is not connected, the query may be wrong. SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’ Works Emp Project Resp Ename RESULT Pname=‘CAD/CAM’

Simplification Why simplify? How? Use transformation rules The simpler the query, the easier (and more efficient) it is to execute it How? Use transformation rules elimination of redundancy idempotency rules p1  ¬( p1)  false p1  (p1  p2)  p1 p1  false  p1 … application of transitivity use of integrity rules

Simplification – Example SELECT Title FROM Emp WHERE Ename = ‘J. Doe’ OR (NOT(Title = ‘Programmer’) AND (Title = ‘Programmer’ OR Title = ‘Elect. Eng.’) AND NOT(Title = ‘Elect. Eng.’)) 

Restructuring Convert SQL to relational algebra Make use of query trees Example SELECT Ename FROM Emp, Works, Project WHERE Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND Ename <> ‘J. Doe’ AND Pname = ‘CAD/CAM’ AND (Dur = 12 OR Dur = 24) ENAME Project DUR=12 OR DUR=24 PNAME=“CAD/CAM” Select ENAME≠“J. DOE” ⋈PNO ⋈ENO Join Project Works Emp

How to implement operators Selection (assume R has n pages) Scan without an index – O(n) Scan with index B+ index – O(logn) Hash index – O(1) Projection Without duplicate elimination – O(n) With duplicate elimination Sorting-based – O(nlogn) Hash-based – O(n+t) where t is the result of hashing phase

How to implement operators (cont’d) Join Nested loop join: R⋈S foreach tuple rR do foreach tuple sS do if r==s then add <r,s> to result O(n*m) Improvements possible by page-oriented nested loop join block-oriented nested loop join

How to implement operators (cont’d) Join Index nested loop join: R⋈S foreach tuple rR do use index on join attr. to find tuples of S foreach such tuple sS do add <r,s> to result Sort-merge join Sort R and S on the join attribute Merge the sorted relations Hash join Hash R and S using a common hash function Within each bucket, find tuples where r=s

Index Selection Guidelines Hash vs tree index Hash index on inner is very good for Index Nested Loops. Should be clustered if join column is not key for inner, and inner tuples need to be retrieved. Clustered B+ tree on join column(s) good for Sort-Merge. 15

Example 1 SELECT e.Ename, w.Dur FROM Emp e, Works w WHERE w.Resp=‘Mgr’ AND e.Eno=w.Eno Hash index on w.Resp supports ‘Mgr’ selection. Hash index on w.Eno allows us to get matching (inner) Emp tuples for each selected (outer) Works tuple. What if WHERE included: “AND e.Title=`Programmer’’’? Could retrieve Emp tuples using index on e.Title, then join with Works tuples satisfying Resp selection. 16

Example 2 SELECT e.Ename, w.Resp FROM Emp e, Works w WHERE e.Age BETWEEN 45 AND 60 AND e.Title=‘Programmer’ AND e.Eno=w.Eno Clearly, Emp should be the outer relation. Suggests that we build a hash index on w.Eno. What index should we build on Emp? B+ tree on e.Age could be used, OR an index on e.Title could be used. Only one of these is needed, and which is better depends upon the selectivity of the conditions. As a rule of thumb, equality selections more selective than range selections. As both examples indicate, our choice of indexes is guided by the plan(s) that we expect an optimizer to consider for a query. Have to understand optimizers! 17

Examples of Clustering SELECT e.Title FROM Emp e WHERE e.Age > 40 B+ tree index on e.Age can be used to get qualifying tuples. How selective is the condition? Is the index clustered? 18

Clustering and Joins SELECT e.Ename, p.Pname FROM Emp e, Project p WHERE p.Budget=‘350000’ AND e.City=p.City Clustering is especially important when accessing inner tuples in Index Nested Loop join. Should make index on e.City clustered. Suppose that the WHERE clause is instead: WHERE e.Title=‘Programmer’ AND e.City=p.City If many employees are Programmers, Sort-Merge join may be worth considering. A clustered index on p.City would help. Summary: Clustering is useful whenever many tuples are to be retrieved. 19

Selecting Alternatives SELECT Ename FROM Emp e,Works w WHERE e.Eno = w.Eno AND w.Dur > 37 Strategy 1 ENAME(DUR>37EMP.ENO=ASG.ENO(Emp  Works)) Strategy 2 ENAME(Emp ⋈ENO (DUR>37 (Works))) Strategy 2 is “better” because It avoids Cartesian product It selects a subset of Works before joining How to determine the “better” alternative?

Query Optimization Issues – Types of Optimizers “Exhaustive” search cost-based optimal combinatorial complexity in the number of relations Heuristics not optimal regroup common sub-expressions perform selection, projection as early as possible reorder operations to reduce intermediate relation size optimize individual operations

Query Optimization Issues – Optimization Granularity Single query at a time cannot use common intermediate results Multiple queries at a time efficient if many similar queries decision space is much larger

Query Optimization Issues – Optimization Timing Static compilation  optimize prior to the execution difficult to estimate the size of the intermediate results  error propagation can amortize over many executions Dynamic run time optimization exact information on the intermediate relation sizes have to reoptimize for multiple executions Hybrid compile using a static algorithm if the error in estimate sizes > threshold, reoptimize at run time

Query Optimization Issues – Statistics Relation cardinality size of a tuple fraction of tuples participating in a join with another relation … Attribute cardinality of domain actual number of distinct values Common assumptions independence between different attribute values uniform distribution of attribute values within their domain

Query Optimization Components Cost function (in terms of time) I/O cost + CPU cost These might have different weights Can also maximize throughput Solution space The set of equivalent algebra expressions (query trees). Search algorithm How do we move inside the solution space? Exhaustive search, heuristic algorithms (iterative improvement, simulated annealing, genetic,…)

Cost Calculation Cost function takes CPU and I/O processing into account Instruction and I/O path lengths Estimate the cost of executing each node of the query tree Is pipelining used or are temporary relations created? Estimate the size of the result of each node Selectivity of operations – “reduction factor” Error propagation is possible

Intermediate Relation Sizes Selection size(R) = card(R) length(R) card(F (R)) = SF (F) card(R) where S F(A = value) = card(∏A(R)) 1 S F(A > value) = max(A) – min(A) max(A) – value S F(A < value) = max(A) – min(A) value – min(A) SF(p(Ai) p(Aj)) = SF(p(Ai))  SF(p(Aj)) SF(p(Ai) p(Aj)) = SF(p(Ai)) + SF(p(Aj)) – (SF(p(Ai))  SF(p(Aj))) SF(A  value) = SF(A= value)  card({values})

Intermediate Relation Sizes Projection card(A(R))=card(R) Cartesian Product card(R  S) = card(R) card(S) Union upper bound: card(R  S) = card(R) + card(S) lower bound: card(R  S) = max{card(R), card(S)} Set Difference upper bound: card(R–S) = card(R) lower bound: 0

Intermediate Relation Size Join Special case: A is a key of R and B is a foreign key of S; card(R ⋈A=B S) = card(S) More general: card(R ⋈ S) = SFJ card(R) card(S)

Search Space Characterized by “equivalent” query plans Equivalence is defined in terms of equivalent query results Equivalent plans are generated by means of algebraic transformation rules The cost of each plan may be different Focus on joins

Search Space – Join Trees ⋈ ⋈ For N relations, there are O(N!) equivalent join trees that can be obtained by applying commutativity and associativity rules SELECT Ename,Resp FROM Emp, Works, Project WHERE Emp.Eno=Works.Eno AND Works.PNO=Project.PNO Project Emp Works ⋈ ⋈ Emp Project Works ⋈  Works Project Emp

Transformation Rules Commutativity of binary operations R  S  S  R R ⋈ S  S ⋈ R R  S  S  R Associativity of binary operations ( R  S )  T  R  (S  T) ( R ⋈ S ) ⋈ T  R ⋈ (S ⋈ T ) Idempotence of unary operations A’(A’’(R)) A’(R) p1(A1)(p2(A2)(R)) = p1(A1)  p2(A2)(R) where R[A] and A'  A, A"  A and A'  A"

Transformation Rules Commuting selection with projection Commuting selection with binary operations p(A)(R  S)  (p(A) (R))  S p(Ai)(R ⋈(Aj,Bk) S)  (p(Ai) (R)) ⋈(Aj,Bk) S p(Ai)(R  T)  p(Ai) (R)  p(Ai) (T) where Ai belongs to R and T Commuting projection with binary operations C(R  S)  A’(R)  B’(S) C(R ⋈(Aj,Bk) S)  A’(R) ⋈(Aj,Bk) B’(S) C(R  S)  C (R)  C (S) where R[A] and S[B]; C = A'  B' where A'  A, B'  B, Aj  A', Bk  B'

Example Consider the query: ENAME DUR=12 OR DUR=24 PNAME=“CAD/CAM” Find the names of employees other than J. Doe who worked on the CAD/CAM project for either one or two years. SELECT Ename FROM Project p, Works w, Emp e WHERE w.Eno=e.Eno AND w.Pno=p.Pno AND Ename<>`J. Doe’ AND p.Pname=`CAD/CAM’ AND (Dur=12 OR Dur=24) ENAME Project DUR=12 OR DUR=24 PNAME=“CAD/CAM” Select ENAME≠“J. DOE” ⋈ ⋈ Join Project Works Emp

Equivalent Query Ename Pname=`CAD/CAM’  (Dur=12  Dur=24) Ename<>`J. DOE’ ⋈  Works Project Emp

Another Equivalent Query Ename ⋈ Pno,Ename ⋈ Pno Pno,Eno Eno,Ename Pname = `CAD/CAM’ Dur =12 Dur=24 Ename <> `J. Doe’ Project Works Emp

Search Strategy How to “move” in the search space. Deterministic Start from base relations and build plans by adding one relation at each step Dynamic programming: breadth-first Greedy: depth-first Randomized Search for optimalities around a particular starting point Trade optimization time for execution time Better when > 5-6 relations Simulated annealing Iterative improvement

Search Algorithms Restrict the search space ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ Use heuristics E.g., Perform unary operations before binary operations Restrict the shape of the join tree Consider only linear trees, ignore bushy ones Linear Join Tree Bushy Join Tree ⋈ ⋈ ⋈ R4 ⋈ ⋈ ⋈ R3 R1 R2 R1 R2 R3 R4

Search Strategies Deterministic Randomized ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ ⋈ R4 R3

Summary Declarative SQL queries need to be converted into low level execution plans These plans need to be optimized to find the “best” plan Optimization involves Search space: identifies the alternative plans and alternative execution algorithms for algebra operators This is done by means of transformation rules Cost function: calculates the cost of executing each plan CPU and I/O costs Search algorithm: controls which alternative plans are investigated